I've just been helping a friend on his site and have had all his new 404 requests emailed to me. Well, I've just got home from taking the kids to Shark Tale to find 190 dubious urls which googlebot has been requesting. Some examples /have.htm /forms.htm /inbus.php /Water.htm /7.4.0.1.asp this is a php site which uses mod_rewrite to have .html pages, and these pages don't and have never existed - this looks like fishing BUT the IP address resolves to Google. so.... * is someone feeding Google crap to make it look like the site is mostly dead * is someone crawling and spoofing the IP * has Googlebot gone nuts? any ideas? Sarah
There was a similar case of Googlebot taking random staps at Atom/RDF files earlier this year in April. That seemed to have a purpose to it though, it was during the time that Yahoo was coming online with the RSS feeds in My Yahoo. This looks to be really insane on the surface. Question, do you have the IP addresses that you can post here? Logfile entries would be nice to see also, if you have access to the raw files. Also, have you tried a backlink check on the domain/filename.ext at any of the SE's yet to see if the possibility exists that somebody is using those Urls as links? You may want to try Yahoo and Google both for this.
Is it possible that someone owned the domain before and has backlinks to these pages on an external site? I get a bunch(a couple a day) of 404's for pages for the former owner of my primary domain, even though I've owned it for 2-3 years. -- Derek
That is a possibility, but look at the extensions of those filenames. How many sites have you come across that were programmed in static Html pages, PHP scripted, and ASP scripted (let alone servers than ran both ASP and PHP in tandem). Not too many. One strong possibility is that it is a Hosting foulup. If this site is on an IP with multiple websites -- googlebot could be looking for pages on one of the other sites on that IP.
Hi SarahK It's not going to be IP spoofing (pointless if you want anything more than a handshake). Also, just because an IP resolves to google, doesn't mean it's Gbot. I presume though the agent string shows Gbot ? (ie - Google's translation tools act as proxies on the G net). As for the site being mostly dead, it's not going to happen either. Pages need to be cached (internally - even if it's just a response header) by G before they are considered to exist (let alone not existing). The only possibility here is that GBot has found these links somewhere and is simply hunting for them. As for a hosting stuff up, GBot ALWAYS presents the host header. If it's incorrectly resolved by the host, then your problems are far worse than a few stray links (ie multiple hostnames on the one IP). I've had a simillar experience in the past, where gbot was looking for one very specific page. It never found it, but the request arrived about once a day for a couple of months. Nor was I ever able to track it through queries. There's nil effect to this really, other than a pointless attempt at watering down relevance of anchor text. If it's anything to worry about, you'll be able to track it by searching G. If it persists, make use of your robots.txt file. Cheers, JL.
Here's a more complete list of files, but I've deleted a few over the past few days. Today's been the big day though, and the flurry seems to have ended. Some seem relevant, most are "off the wall"! The IP is always 66.249.66.201 but the port changes constantly. The user agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) The wayback machine at http://web.archive.org/web/ has nothing prior to August 2003 when the current site was implemented. It's always been PHP. The fact that it fished for about.htm, about_us.htm and aboutus.htm is interesting. This site has recently implemented RSS and the feeds have been submitted to the major feed engines. /5.3.2.1.asp /5.3.3.1.asp /5.4.0.1.asp /5.5.0.1.asp /5.5.6.1.asp /5.6.0.1.asp /5.6.2.1.asp /5.7.0.1.asp /6.1.0.1.asp /6.1.0.9.asp /6.2.0.1.asp /6.3.0.1.asp /7.10.0.1.asp /7.10.0.2.asp /7.3.0.2.asp /7.4.0.1.asp /7.48.11.1.asp /7.5.0.1.asp /7.6.0.1.asp /7.78.0.2.asp /7.8.0.1.asp /7.9.0.1.asp /7.9.0.3.asp /7.9.0.5.asp /about.htm /about_us.htm /aboutus.htm /acts.html /adminp.php /archives.html /awards.html /A-ZIndex.htm /banner.html /beds.htm /benefits.htm /birdfood.htm /bonds.html /briefings.php /business.html /button1.swf /button2.swf /button3.swf /button4.swf /button5.swf /carrybag.htm /catbowls.htm /catfood.htm /cattoys.htm /cattreats.htm /chalkie.html /chillers.htm /civdef.htm /cmanager.htm /cna.html /cnews.htm /cnews2.htm /cnews5.htm /colleash.htm /company.html /comserv.htm /consents.html /contact.htm /contacts.html /contactus.asp /contactus.htm /Copyright.htm /cssfaq.htm /Default.asp /default.asp /default.htm /default.html /dfr.html /dogbowls.htm /dogfood.htm /dogtoys.htm /dogtreats.htm /eden.htm /elected.htm /Elections.asp /enquiry.html /envmanag.htm /face1.htm /faqs.htm /fbulbs.htm /features.html /feedback.html /fees.html /filters.htm /finance /fishnets.htm /formIE.css /forms.htm /forms.html /funstuff.htm /govtdpts.asp /gravel.htm /have.htm /how.htm /inbus.php /incco.php /incon.php /index.asp /Index.html /index.pr.html /index.shtml /index1.html /index2.html /index3.html /ineve.php /infaq.php /info.html /infotech.html /inlib.php /innap.php /innew.php /kids /latest.htm /Links.asp /locale1.htm /lotto /magazines.htm /mailform3.php /main.html /map.html /mayor.htm /mediak.html /motoring.html /nav.html /news /newzealand /nl.asp?731961 /notices.htm /oddstuff.html /offices.html /Orderform.pdf /otherorg.html /ourfees.htm /paper.htm /parks.htm /peb_02.html /personalise /plans.htm /plants.htm /politics.html /post1489.html. /posts1718 /privacy.html /profiles/firmprofile.htm /profiles/legalservices.html /project.htm /prosalts.htm /publications/employers_guides_index.htm /publications/employment_articles/disputes_influence_mediators.htm /publications/health_articles/retirementvillages_occupationrightagreement.htm /publications/health_articles/seniorlaw_asset_rich_and_income_poor.htm /publications/misc_archive/leakybuildings.htm /publications/misc_archive/weathertight_homes_tenyear_longstop.htm /publications/pdf_docs/emp_relations_aug03.pdf /publications/pdf_docs/emp_relations_aug04.pdf /publications/pdf_docs/employment_seminar_Sept04.pdf /publications/pdf_docs/insbrief_dec03.pdf /publications/pdf_docs/legal_letter_oct03f.pdf /publications/pdf_docs/legaltorque_aug04.pdf /publications/pdf_docs/terms_and_conditions.pdf /publications/property_articles/real_estate_what_to_look_for.dwt /publications/publicationsindex.htm /publications/trucker_guides_index.htm /publications/trust_articles/memorandum_wishes.htm /publications/trust_articles/NZ_trust_property_investment.dwt /publications/trust_articles/tax_treatment_trust_migration_to_nz.htm /publications/trusts_articles.htm /pubs.htm /quick1.htm /ratesinfo.htm /recent.htm /recreat.htm /recreat5.htm /reforms.html /register.htm /reports.htm /returns.htm /road.htm /rta.html /rural.html /search.asp /search.htm /security.htm /servic.htm /services.htm /shipping.htm /shopping /sister.htm /Site-Map.asp /sitemap.shtml /skillsedu.htm /sport.html /story1.html /story2.html /story3.html /tec.htm /terms.htm /testkits.htm /tour.htm /transpor.htm /travel.html /tribunal.html /tv_guide.html /vacancies/vacancies_index.htm /visit1.htm /Water.htm /weather /weather.html /wl.asp?37221 /ws4.swf
I ran an inurl: query on some of those filepath/names. One site that is common with those that I checked is www.fmlaw.co.nz The nameservers for this site are: Ns Name 01 alien.xtra.co.nz Ns Ip4 01 202.27.184.3 Ns Name 02 terminator.xtra.co.nz Ns Ip4 02 202.27.184.5 Code (markup): Does the site you are monitoring have anything in common with this at all? If so, I think it is a hosting issue that you will need to get worked out.
BTW, the 66.249.66.xxx range of IP's are the newest Googlebots. They are all over the place now. One of them is using an HTTP/1.1 GET that is visible in your raw logs. Not sure if that has anything to do with it though, but I would not rule it out.
Hi Dodger Thanks for the suggestion. I've done some tests and have found that NZ pages are common too. Some are obvious - like they have NZ in the path - but others like /peb_02.html relate only to NZ sites. However the hosting is with different companies for each of the sites I checked. I guess we'll just have to wait and see... Sarah
How did you determine who the hosting company was? The only info I could dig up of worth was the DNS info that I posted earlier. I usually use http://whois.sc for .net, .com sites and it will show you just about everything you need to know including a list of websites that are on a single IP. Unfortunately Country specific TLD's are not included in their databases.
New Zealand is pretty small. Tells me it's hosted at Xtra, others are at Clear and so on. We don't have a big reseller market down here, so there's no problem with thinking a box is in one place when really it's in another. your site has so I'd know you were with ogre hosting - as a starting point. does that make sense? Sarah
I understand that part. But sometimes it is not that cut and dry. The WhoIs that I use shows how many websites are on a single IP, histories of the IP, who the hosting company is, registrar (if different), dns service (if different) etc. It is one of the best WhoIs lookups around.
I think he gave it to you in post 10 SK. Hope you are doing well, I was going to tell Ronnie never to argue with you, but he will find out on his own
Yep, as Anthony said, post#10 http://whois.sc/ It does require registration though, but it is free. They have more advanced tools if you sign up for yearly memberships. For the most part, the basic free service works wonders by itself.
Shucks Anthony, I'm not that bad! And you're right about post #10, sorry I missed that. Its a while since I visited whois.sc and it's grown up a lot. Could only see 3 of the 1195 sites on the server without paying $14 a month for "silver membership" and since I'm just curious I don't feel like coughing up. I'll take some time to look around and see what else they've added. Sarah
Sarah, I've been having the same problem myself. I've never had some of the files requested and I'm the original owner of the domain. One of the most common files not found is "favicon.ico" What is weird is that I HAVE a favicon.ico so I don't know why I'm getting the 404
Is your favicon.ico in your root directory? It's possible to specify it as elsewhere but I suspect it may be searched for in the root directory first...
Yes, it is in the root directory. In fact, I suspected this was a bogus error so I recently uploaded the file to every folder [about 10]. I'm still getting the error. I can't explain it.