What's Googlebot been sniffing?

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#1

I've just been helping a friend on his site and have had all his new 404 requests emailed to me. Well, I've just got home from taking the kids to Shark Tale to find 190 dubious urls which googlebot has been requesting.

Some examples
/have.htm
/forms.htm
/inbus.php
/Water.htm
/7.4.0.1.asp

this is a php site which uses mod_rewrite to have .html pages, and these pages don't and have never existed - this looks like fishing BUT the IP address resolves to Google.

so....
* is someone feeding Google crap to make it look like the site is mostly dead
* is someone crawling and spoofing the IP
* has Googlebot gone nuts?

any ideas?

Sarah

sarahk, Sep 22, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#2

There was a similar case of Googlebot taking random staps at Atom/RDF files earlier this year in April. That seemed to have a purpose to it though, it was during the time that Yahoo was coming online with the RSS feeds in My Yahoo.

This looks to be really insane on the surface.

Question, do you have the IP addresses that you can post here? Logfile entries would be nice to see also, if you have access to the raw files.

Also, have you tried a backlink check on the domain/filename.ext at any of the SE's yet to see if the possibility exists that somebody is using those Urls as links? You may want to try Yahoo and Google both for this.

Dodger, Sep 22, 2004 IP

dkalweit Well-Known Member

Messages:: 520

Likes Received:: 35

Best Answers:: 0

Trophy Points:: 150

#3

sarahk said:

I've just been helping a friend on his site and have had all his new 404 requests emailed to me. Well, I've just got home from taking the kids to Shark Tale to find 190 dubious urls which googlebot has been requesting.

Some examples
/have.htm
/forms.htm
/inbus.php
/Water.htm
/7.4.0.1.asp
<snip>
any ideas?

Sarah

Click to expand...

Is it possible that someone owned the domain before and has backlinks to these pages on an external site? I get a bunch(a couple a day) of 404's for pages for the former owner of my primary domain, even though I've owned it for 2-3 years.

--
Derek

dkalweit, Sep 22, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#4

dkalweit said:

Is it possible that someone owned the domain before and has backlinks to these pages on an external site? I get a bunch(a couple a day) of 404's for pages for the former owner of my primary domain, even though I've owned it for 2-3 years.
Click to expand...

That is a possibility, but look at the extensions of those filenames. How many sites have you come across that were programmed in static Html pages, PHP scripted, and ASP scripted (let alone servers than ran both ASP and PHP in tandem). Not too many.

One strong possibility is that it is a Hosting foulup. If this site is on an IP with multiple websites -- googlebot could be looking for pages on one of the other sites on that IP.

Dodger, Sep 22, 2004 IP

john_loch Rodent Slayer

Messages:: 1,294

Likes Received:: 66

Best Answers:: 0

Trophy Points:: 138

#5

Hi SarahK

It's not going to be IP spoofing (pointless if you want anything more than a handshake).

Also, just because an IP resolves to google, doesn't mean it's Gbot.
I presume though the agent string shows Gbot ? (ie - Google's translation tools act as proxies on the G net).

As for the site being mostly dead, it's not going to happen either. Pages need to be cached (internally - even if it's just a response header) by G before they are considered to exist (let alone not existing).

The only possibility here is that GBot has found these links somewhere and is simply hunting for them.

As for a hosting stuff up, GBot ALWAYS presents the host header. If it's incorrectly resolved by the host, then your problems are far worse than a few stray links (ie multiple hostnames on the one IP).

I've had a simillar experience in the past, where gbot was looking for one very specific page. It never found it, but the request arrived about once a day for a couple of months. Nor was I ever able to track it through queries.

There's nil effect to this really, other than a pointless attempt at watering down relevance of anchor text. If it's anything to worry about, you'll be able to track it by searching G.

If it persists, make use of your robots.txt file.

Cheers,

JL.

john_loch, Sep 22, 2004 IP

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#6

Here's a more complete list of files, but I've deleted a few over the past few days. Today's been the big day though, and the flurry seems to have ended. Some seem relevant, most are "off the wall"!

The IP is always 66.249.66.201 but the port changes constantly.
The user agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The wayback machine at http://web.archive.org/web/ has nothing prior to August 2003 when the current site was implemented. It's always been PHP.

The fact that it fished for about.htm, about_us.htm and aboutus.htm is interesting. This site has recently implemented RSS and the feeds have been submitted to the major feed engines.

/5.3.2.1.asp
/5.3.3.1.asp
/5.4.0.1.asp
/5.5.0.1.asp
/5.5.6.1.asp
/5.6.0.1.asp
/5.6.2.1.asp
/5.7.0.1.asp
/6.1.0.1.asp
/6.1.0.9.asp
/6.2.0.1.asp
/6.3.0.1.asp
/7.10.0.1.asp
/7.10.0.2.asp
/7.3.0.2.asp
/7.4.0.1.asp
/7.48.11.1.asp
/7.5.0.1.asp
/7.6.0.1.asp
/7.78.0.2.asp
/7.8.0.1.asp
/7.9.0.1.asp
/7.9.0.3.asp
/7.9.0.5.asp
/about.htm
/about_us.htm
/aboutus.htm
/acts.html
/adminp.php
/archives.html
/awards.html
/A-ZIndex.htm
/banner.html
/beds.htm
/benefits.htm
/birdfood.htm
/bonds.html
/briefings.php
/business.html
/button1.swf
/button2.swf
/button3.swf
/button4.swf
/button5.swf
/carrybag.htm
/catbowls.htm
/catfood.htm
/cattoys.htm
/cattreats.htm
/chalkie.html
/chillers.htm
/civdef.htm
/cmanager.htm
/cna.html
/cnews.htm
/cnews2.htm
/cnews5.htm
/colleash.htm
/company.html
/comserv.htm
/consents.html
/contact.htm
/contacts.html
/contactus.asp
/contactus.htm
/Copyright.htm
/cssfaq.htm
/Default.asp
/default.asp
/default.htm
/default.html
/dfr.html
/dogbowls.htm
/dogfood.htm
/dogtoys.htm
/dogtreats.htm
/eden.htm
/elected.htm
/Elections.asp
/enquiry.html
/envmanag.htm
/face1.htm
/faqs.htm
/fbulbs.htm
/features.html
/feedback.html
/fees.html
/filters.htm
/finance
/fishnets.htm
/formIE.css
/forms.htm
/forms.html
/funstuff.htm
/govtdpts.asp
/gravel.htm
/have.htm
/how.htm
/inbus.php
/incco.php
/incon.php
/index.asp
/Index.html
/index.pr.html
/index.shtml
/index1.html
/index2.html
/index3.html
/ineve.php
/infaq.php
/info.html
/infotech.html
/inlib.php
/innap.php
/innew.php
/kids
/latest.htm
/Links.asp
/locale1.htm
/lotto
/magazines.htm
/mailform3.php
/main.html
/map.html
/mayor.htm
/mediak.html
/motoring.html
/nav.html
/news
/newzealand
/nl.asp?731961
/notices.htm
/oddstuff.html
/offices.html
/Orderform.pdf
/otherorg.html
/ourfees.htm
/paper.htm
/parks.htm
/peb_02.html
/personalise
/plans.htm
/plants.htm
/politics.html
/post1489.html.
/posts1718
/privacy.html
/profiles/firmprofile.htm
/profiles/legalservices.html
/project.htm
/prosalts.htm
/publications/employers_guides_index.htm
/publications/employment_articles/disputes_influence_mediators.htm
/publications/health_articles/retirementvillages_occupationrightagreement.htm
/publications/health_articles/seniorlaw_asset_rich_and_income_poor.htm
/publications/misc_archive/leakybuildings.htm
/publications/misc_archive/weathertight_homes_tenyear_longstop.htm
/publications/pdf_docs/emp_relations_aug03.pdf
/publications/pdf_docs/emp_relations_aug04.pdf
/publications/pdf_docs/employment_seminar_Sept04.pdf
/publications/pdf_docs/insbrief_dec03.pdf
/publications/pdf_docs/legal_letter_oct03f.pdf
/publications/pdf_docs/legaltorque_aug04.pdf
/publications/pdf_docs/terms_and_conditions.pdf
/publications/property_articles/real_estate_what_to_look_for.dwt
/publications/publicationsindex.htm
/publications/trucker_guides_index.htm
/publications/trust_articles/memorandum_wishes.htm
/publications/trust_articles/NZ_trust_property_investment.dwt
/publications/trust_articles/tax_treatment_trust_migration_to_nz.htm
/publications/trusts_articles.htm
/pubs.htm
/quick1.htm
/ratesinfo.htm
/recent.htm
/recreat.htm
/recreat5.htm
/reforms.html
/register.htm
/reports.htm
/returns.htm
/road.htm
/rta.html
/rural.html
/search.asp
/search.htm
/security.htm
/servic.htm
/services.htm
/shipping.htm
/shopping
/sister.htm
/Site-Map.asp
/sitemap.shtml
/skillsedu.htm
/sport.html
/story1.html
/story2.html
/story3.html
/tec.htm
/terms.htm
/testkits.htm
/tour.htm
/transpor.htm
/travel.html
/tribunal.html
/tv_guide.html
/vacancies/vacancies_index.htm
/visit1.htm
/Water.htm
/weather
/weather.html
/wl.asp?37221
/ws4.swf

sarahk, Sep 23, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#7

I ran an inurl: query on some of those filepath/names. One site that is common with those that I checked is www.fmlaw.co.nz

The nameservers for this site are:
Ns Name 01   alien.xtra.co.nz 
Ns Ip4 01    202.27.184.3 
Ns Name 02   terminator.xtra.co.nz 
Ns Ip4 02    202.27.184.5
Code (markup):
Does the site you are monitoring have anything in common with this at all? If so, I think it is a hosting issue that you will need to get worked out.

Dodger, Sep 23, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#8

BTW, the 66.249.66.xxx range of IP's are the newest Googlebots. They are all over the place now.

One of them is using an HTTP/1.1 GET that is visible in your raw logs. Not sure if that has anything to do with it though, but I would not rule it out.

Dodger, Sep 23, 2004 IP

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#9

Hi Dodger

Thanks for the suggestion. I've done some tests and have found that NZ pages are common too. Some are obvious - like they have NZ in the path - but others like /peb_02.html relate only to NZ sites.

However the hosting is with different companies for each of the sites I checked.

I guess we'll just have to wait and see...

Sarah

sarahk, Sep 23, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#10

How did you determine who the hosting company was? The only info I could dig up of worth was the DNS info that I posted earlier.

I usually use http://whois.sc for .net, .com sites and it will show you just about everything you need to know including a list of websites that are on a single IP. Unfortunately Country specific TLD's are not included in their databases.

Dodger, Sep 23, 2004 IP

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#11

New Zealand is pretty small.

Ns Name 01 alien.xtra.co.nz
Click to expand...

Tells me it's hosted at Xtra, others are at Clear and so on. We don't have a big reseller market down here, so there's no problem with thinking a box is in one place when really it's in another.

your site has

NS.OGREHOSTING.COM 64.246.0.38
Click to expand...

so I'd know you were with ogre hosting - as a starting point.

does that make sense?

Sarah

sarahk, Sep 23, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#12

I understand that part. But sometimes it is not that cut and dry.

The WhoIs that I use shows how many websites are on a single IP, histories of the IP, who the hosting company is, registrar (if different), dns service (if different) etc. It is one of the best WhoIs lookups around.

Dodger, Sep 23, 2004 IP

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#13

sounds great, care to share the url?

sarahk, Sep 23, 2004 IP

anthonycea Banned

Messages:: 13,378

Likes Received:: 342

Best Answers:: 0

Trophy Points:: 0

#14

I think he gave it to you in post 10 SK.

Hope you are doing well, I was going to tell Ronnie never to argue with you, but he will find out on his own

anthonycea, Sep 23, 2004 IP

Dodger Peon

Messages:: 1,494

Likes Received:: 60

Best Answers:: 0

Trophy Points:: 0

#15

sarahk said:

sounds great, care to share the url?
Click to expand...

Yep, as Anthony said, post#10 http://whois.sc/ It does require registration though, but it is free. They have more advanced tools if you sign up for yearly memberships. For the most part, the basic free service works wonders by itself.

Dodger, Sep 23, 2004 IP

sarahk iTamer Staff

Messages:: 28,789

Likes Received:: 4,528

Best Answers:: 123

Trophy Points:: 665

#16

anthonycea said:

I think he gave it to you in post 10 SK.

Hope you are doing well, I was going to tell Ronnie never to argue with you, but he will find out on his own
Click to expand...

Shucks Anthony, I'm not that bad!

And you're right about post #10, sorry I missed that.

Its a while since I visited whois.sc and it's grown up a lot. Could only see 3 of the 1195 sites on the server without paying $14 a month for "silver membership" and since I'm just curious I don't feel like coughing up.

I'll take some time to look around and see what else they've added.

Sarah

sarahk, Sep 23, 2004 IP

ResaleBroker Active Member

Messages:: 1,665

Likes Received:: 50

Best Answers:: 0

Trophy Points:: 90

#17

Sarah,

I've been having the same problem myself. I've never had some of the files requested and I'm the original owner of the domain.

One of the most common files not found is "favicon.ico" What is weird is that I HAVE a favicon.ico so I don't know why I'm getting the 404

ResaleBroker, Sep 23, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#18

ResaleBroker said:

One of the most common files not found is "favicon.ico" What is weird is that I HAVE a favicon.ico so I don't know why I'm getting the 404
Click to expand...

Is your favicon.ico in your root directory? It's possible to specify it as elsewhere but I suspect it may be searched for in the root directory first...

minstrel, Sep 23, 2004 IP

ResaleBroker Active Member

Messages:: 1,665

Likes Received:: 50

Best Answers:: 0

Trophy Points:: 90

#19

Yes, it is in the root directory. In fact, I suspected this was a bogus error so I recently uploaded the file to every folder [about 10]. I'm still getting the error. I can't explain it.

ResaleBroker, Sep 23, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#20

What's the URL?

minstrel, Sep 23, 2004 IP

Log in or Sign up

What's Googlebot been sniffing?

sarahk iTamer Staff

Dodger Peon

dkalweit Well-Known Member

Dodger Peon

john_loch Rodent Slayer

sarahk iTamer Staff

Dodger Peon

Dodger Peon

sarahk iTamer Staff

Dodger Peon

sarahk iTamer Staff

Dodger Peon

sarahk iTamer Staff

anthonycea Banned

Dodger Peon

sarahk iTamer Staff

ResaleBroker Active Member

minstrel Illustrious Member

ResaleBroker Active Member

minstrel Illustrious Member

Useful Searches