What's Googlebot been sniffing?

Discussion in 'Google' started by sarahk, Sep 22, 2004.

  1. #1
    I've just been helping a friend on his site and have had all his new 404 requests emailed to me. Well, I've just got home from taking the kids to Shark Tale to find 190 dubious urls which googlebot has been requesting.

    Some examples
    /have.htm
    /forms.htm
    /inbus.php
    /Water.htm
    /7.4.0.1.asp


    this is a php site which uses mod_rewrite to have .html pages, and these pages don't and have never existed - this looks like fishing BUT the IP address resolves to Google.

    so....
    * is someone feeding Google crap to make it look like the site is mostly dead
    * is someone crawling and spoofing the IP
    * has Googlebot gone nuts?

    any ideas?

    Sarah
     
    sarahk, Sep 22, 2004 IP
  2. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #2
    There was a similar case of Googlebot taking random staps at Atom/RDF files earlier this year in April. That seemed to have a purpose to it though, it was during the time that Yahoo was coming online with the RSS feeds in My Yahoo.

    This looks to be really insane on the surface.

    Question, do you have the IP addresses that you can post here? Logfile entries would be nice to see also, if you have access to the raw files.

    Also, have you tried a backlink check on the domain/filename.ext at any of the SE's yet to see if the possibility exists that somebody is using those Urls as links? You may want to try Yahoo and Google both for this.
     
    Dodger, Sep 22, 2004 IP
  3. dkalweit

    dkalweit Well-Known Member

    Messages:
    520
    Likes Received:
    35
    Best Answers:
    0
    Trophy Points:
    150
    #3
    Is it possible that someone owned the domain before and has backlinks to these pages on an external site? I get a bunch(a couple a day) of 404's for pages for the former owner of my primary domain, even though I've owned it for 2-3 years.


    --
    Derek
     
    dkalweit, Sep 22, 2004 IP
  4. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #4
    That is a possibility, but look at the extensions of those filenames. How many sites have you come across that were programmed in static Html pages, PHP scripted, and ASP scripted (let alone servers than ran both ASP and PHP in tandem). Not too many.

    One strong possibility is that it is a Hosting foulup. If this site is on an IP with multiple websites -- googlebot could be looking for pages on one of the other sites on that IP.
     
    Dodger, Sep 22, 2004 IP
  5. john_loch

    john_loch Rodent Slayer

    Messages:
    1,294
    Likes Received:
    66
    Best Answers:
    0
    Trophy Points:
    138
    #5
    Hi SarahK

    It's not going to be IP spoofing (pointless if you want anything more than a handshake).

    Also, just because an IP resolves to google, doesn't mean it's Gbot.
    I presume though the agent string shows Gbot ? (ie - Google's translation tools act as proxies on the G net).

    As for the site being mostly dead, it's not going to happen either. Pages need to be cached (internally - even if it's just a response header) by G before they are considered to exist (let alone not existing).

    The only possibility here is that GBot has found these links somewhere and is simply hunting for them.

    As for a hosting stuff up, GBot ALWAYS presents the host header. If it's incorrectly resolved by the host, then your problems are far worse than a few stray links (ie multiple hostnames on the one IP).

    I've had a simillar experience in the past, where gbot was looking for one very specific page. It never found it, but the request arrived about once a day for a couple of months. Nor was I ever able to track it through queries.

    There's nil effect to this really, other than a pointless attempt at watering down relevance of anchor text. If it's anything to worry about, you'll be able to track it by searching G.

    If it persists, make use of your robots.txt file.

    Cheers,

    JL.
     
    john_loch, Sep 22, 2004 IP
  6. sarahk

    sarahk iTamer Staff

    Messages:
    28,789
    Likes Received:
    4,528
    Best Answers:
    123
    Trophy Points:
    665
    #6
    Here's a more complete list of files, but I've deleted a few over the past few days. Today's been the big day though, and the flurry seems to have ended. Some seem relevant, most are "off the wall"!

    The IP is always 66.249.66.201 but the port changes constantly.
    The user agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    The wayback machine at http://web.archive.org/web/ has nothing prior to August 2003 when the current site was implemented. It's always been PHP.

    The fact that it fished for about.htm, about_us.htm and aboutus.htm is interesting. This site has recently implemented RSS and the feeds have been submitted to the major feed engines.

    /5.3.2.1.asp
    /5.3.3.1.asp
    /5.4.0.1.asp
    /5.5.0.1.asp
    /5.5.6.1.asp
    /5.6.0.1.asp
    /5.6.2.1.asp
    /5.7.0.1.asp
    /6.1.0.1.asp
    /6.1.0.9.asp
    /6.2.0.1.asp
    /6.3.0.1.asp
    /7.10.0.1.asp
    /7.10.0.2.asp
    /7.3.0.2.asp
    /7.4.0.1.asp
    /7.48.11.1.asp
    /7.5.0.1.asp
    /7.6.0.1.asp
    /7.78.0.2.asp
    /7.8.0.1.asp
    /7.9.0.1.asp
    /7.9.0.3.asp
    /7.9.0.5.asp
    /about.htm
    /about_us.htm
    /aboutus.htm
    /acts.html
    /adminp.php
    /archives.html
    /awards.html
    /A-ZIndex.htm
    /banner.html
    /beds.htm
    /benefits.htm
    /birdfood.htm
    /bonds.html
    /briefings.php
    /business.html
    /button1.swf
    /button2.swf
    /button3.swf
    /button4.swf
    /button5.swf
    /carrybag.htm
    /catbowls.htm
    /catfood.htm
    /cattoys.htm
    /cattreats.htm
    /chalkie.html
    /chillers.htm
    /civdef.htm
    /cmanager.htm
    /cna.html
    /cnews.htm
    /cnews2.htm
    /cnews5.htm
    /colleash.htm
    /company.html
    /comserv.htm
    /consents.html
    /contact.htm
    /contacts.html
    /contactus.asp
    /contactus.htm
    /Copyright.htm
    /cssfaq.htm
    /Default.asp
    /default.asp
    /default.htm
    /default.html
    /dfr.html
    /dogbowls.htm
    /dogfood.htm
    /dogtoys.htm
    /dogtreats.htm
    /eden.htm
    /elected.htm
    /Elections.asp
    /enquiry.html
    /envmanag.htm
    /face1.htm
    /faqs.htm
    /fbulbs.htm
    /features.html
    /feedback.html
    /fees.html
    /filters.htm
    /finance
    /fishnets.htm
    /formIE.css
    /forms.htm
    /forms.html
    /funstuff.htm
    /govtdpts.asp
    /gravel.htm
    /have.htm
    /how.htm
    /inbus.php
    /incco.php
    /incon.php
    /index.asp
    /Index.html
    /index.pr.html
    /index.shtml
    /index1.html
    /index2.html
    /index3.html
    /ineve.php
    /infaq.php
    /info.html
    /infotech.html
    /inlib.php
    /innap.php
    /innew.php
    /kids
    /latest.htm
    /Links.asp
    /locale1.htm
    /lotto
    /magazines.htm
    /mailform3.php
    /main.html
    /map.html
    /mayor.htm
    /mediak.html
    /motoring.html
    /nav.html
    /news
    /newzealand
    /nl.asp?731961
    /notices.htm
    /oddstuff.html
    /offices.html
    /Orderform.pdf
    /otherorg.html
    /ourfees.htm
    /paper.htm
    /parks.htm
    /peb_02.html
    /personalise
    /plans.htm
    /plants.htm
    /politics.html
    /post1489.html.
    /posts1718
    /privacy.html
    /profiles/firmprofile.htm
    /profiles/legalservices.html
    /project.htm
    /prosalts.htm
    /publications/employers_guides_index.htm
    /publications/employment_articles/disputes_influence_mediators.htm
    /publications/health_articles/retirementvillages_occupationrightagreement.htm
    /publications/health_articles/seniorlaw_asset_rich_and_income_poor.htm
    /publications/misc_archive/leakybuildings.htm
    /publications/misc_archive/weathertight_homes_tenyear_longstop.htm
    /publications/pdf_docs/emp_relations_aug03.pdf
    /publications/pdf_docs/emp_relations_aug04.pdf
    /publications/pdf_docs/employment_seminar_Sept04.pdf
    /publications/pdf_docs/insbrief_dec03.pdf
    /publications/pdf_docs/legal_letter_oct03f.pdf
    /publications/pdf_docs/legaltorque_aug04.pdf
    /publications/pdf_docs/terms_and_conditions.pdf
    /publications/property_articles/real_estate_what_to_look_for.dwt
    /publications/publicationsindex.htm
    /publications/trucker_guides_index.htm
    /publications/trust_articles/memorandum_wishes.htm
    /publications/trust_articles/NZ_trust_property_investment.dwt
    /publications/trust_articles/tax_treatment_trust_migration_to_nz.htm
    /publications/trusts_articles.htm
    /pubs.htm
    /quick1.htm
    /ratesinfo.htm
    /recent.htm
    /recreat.htm
    /recreat5.htm
    /reforms.html
    /register.htm
    /reports.htm
    /returns.htm
    /road.htm
    /rta.html
    /rural.html
    /search.asp
    /search.htm
    /security.htm
    /servic.htm
    /services.htm
    /shipping.htm
    /shopping
    /sister.htm
    /Site-Map.asp
    /sitemap.shtml
    /skillsedu.htm
    /sport.html
    /story1.html
    /story2.html
    /story3.html
    /tec.htm
    /terms.htm
    /testkits.htm
    /tour.htm
    /transpor.htm
    /travel.html
    /tribunal.html
    /tv_guide.html
    /vacancies/vacancies_index.htm
    /visit1.htm
    /Water.htm
    /weather
    /weather.html
    /wl.asp?37221
    /ws4.swf
     
    sarahk, Sep 23, 2004 IP
  7. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I ran an inurl: query on some of those filepath/names. One site that is common with those that I checked is www.fmlaw.co.nz

    The nameservers for this site are:

    Ns Name 01   alien.xtra.co.nz 
    Ns Ip4 01    202.27.184.3 
    Ns Name 02   terminator.xtra.co.nz 
    Ns Ip4 02    202.27.184.5
    Code (markup):
    Does the site you are monitoring have anything in common with this at all? If so, I think it is a hosting issue that you will need to get worked out.
     
    Dodger, Sep 23, 2004 IP
  8. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #8
    BTW, the 66.249.66.xxx range of IP's are the newest Googlebots. They are all over the place now.

    One of them is using an HTTP/1.1 GET that is visible in your raw logs. Not sure if that has anything to do with it though, but I would not rule it out.
     
    Dodger, Sep 23, 2004 IP
  9. sarahk

    sarahk iTamer Staff

    Messages:
    28,789
    Likes Received:
    4,528
    Best Answers:
    123
    Trophy Points:
    665
    #9
    Hi Dodger

    Thanks for the suggestion. I've done some tests and have found that NZ pages are common too. Some are obvious - like they have NZ in the path - but others like /peb_02.html relate only to NZ sites.

    However the hosting is with different companies for each of the sites I checked.

    I guess we'll just have to wait and see...

    Sarah
     
    sarahk, Sep 23, 2004 IP
  10. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #10
    How did you determine who the hosting company was? The only info I could dig up of worth was the DNS info that I posted earlier.

    I usually use http://whois.sc for .net, .com sites and it will show you just about everything you need to know including a list of websites that are on a single IP. Unfortunately Country specific TLD's are not included in their databases.
     
    Dodger, Sep 23, 2004 IP
  11. sarahk

    sarahk iTamer Staff

    Messages:
    28,789
    Likes Received:
    4,528
    Best Answers:
    123
    Trophy Points:
    665
    #11
    New Zealand is pretty small.
    Tells me it's hosted at Xtra, others are at Clear and so on. We don't have a big reseller market down here, so there's no problem with thinking a box is in one place when really it's in another.

    your site has
    so I'd know you were with ogre hosting - as a starting point.

    does that make sense?

    Sarah
     
    sarahk, Sep 23, 2004 IP
  12. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I understand that part. But sometimes it is not that cut and dry.

    The WhoIs that I use shows how many websites are on a single IP, histories of the IP, who the hosting company is, registrar (if different), dns service (if different) etc. It is one of the best WhoIs lookups around.
     
    Dodger, Sep 23, 2004 IP
  13. sarahk

    sarahk iTamer Staff

    Messages:
    28,789
    Likes Received:
    4,528
    Best Answers:
    123
    Trophy Points:
    665
    #13
    sounds great, care to share the url?
     
    sarahk, Sep 23, 2004 IP
  14. anthonycea

    anthonycea Banned

    Messages:
    13,378
    Likes Received:
    342
    Best Answers:
    0
    Trophy Points:
    0
    #14
    I think he gave it to you in post 10 SK.

    Hope you are doing well, I was going to tell Ronnie never to argue with you, but he will find out on his own :D
     
    anthonycea, Sep 23, 2004 IP
  15. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Yep, as Anthony said, post#10 http://whois.sc/ It does require registration though, but it is free. They have more advanced tools if you sign up for yearly memberships. For the most part, the basic free service works wonders by itself.
     
    Dodger, Sep 23, 2004 IP
  16. sarahk

    sarahk iTamer Staff

    Messages:
    28,789
    Likes Received:
    4,528
    Best Answers:
    123
    Trophy Points:
    665
    #16
    Shucks Anthony, I'm not that bad!

    And you're right about post #10, sorry I missed that.

    Its a while since I visited whois.sc and it's grown up a lot. Could only see 3 of the 1195 sites on the server without paying $14 a month for "silver membership" and since I'm just curious I don't feel like coughing up.

    I'll take some time to look around and see what else they've added.

    Sarah
     
    sarahk, Sep 23, 2004 IP
  17. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #17
    Sarah,

    I've been having the same problem myself. I've never had some of the files requested and I'm the original owner of the domain.

    One of the most common files not found is "favicon.ico" What is weird is that I HAVE a favicon.ico so I don't know why I'm getting the 404 :confused:
     
    ResaleBroker, Sep 23, 2004 IP
  18. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #18
    Is your favicon.ico in your root directory? It's possible to specify it as elsewhere but I suspect it may be searched for in the root directory first...
     
    minstrel, Sep 23, 2004 IP
  19. ResaleBroker

    ResaleBroker Active Member

    Messages:
    1,665
    Likes Received:
    50
    Best Answers:
    0
    Trophy Points:
    90
    #19
    Yes, it is in the root directory. In fact, I suspected this was a bogus error so I recently uploaded the file to every folder [about 10]. I'm still getting the error. I can't explain it.
     
    ResaleBroker, Sep 23, 2004 IP
  20. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #20
    What's the URL?
     
    minstrel, Sep 23, 2004 IP