OK... here is a a question/problem and I would greatly appreciate any help! Site is in sig - homesalewizard. Robots.txt is set as: User-agent: Googlebot Disallow: /buy/ Disallow: /sell/ Those are directories mostly for user's accounts. Googlebot continues to crawl through them... So the question is - if */buy/* disallowed would it automatically exclude something like */buy/savelisting.php?homeid=191* ? I feel like we are in the middle of mess with indexing. G's used to show internal pages in SERP and seems like it's not any longer. Thanks for help!
Hmmmmm ... very puzzling situation. I'm not sure if I've resolved it or not, but here's some help with your problem-solving. Using the validation tool at SearchEngineWorld, I checked the syntax of your robots.txt file and found no obvious errors. I then doublechecked Google's advice to webmasters on this topic. They have some helpful instructions -- all of which you seem to be following -- in their webmaster FAQs. So ... no obvious problems that I could see. Any thoughts from web gurus more experienced with these issues? Best, - James
Your robots.txt file is all messed up now... I don't know if it looked like this when Jamews tried to validate it but it's full of errors now. For one thing, you have invalid user-agent designations as well as comments in the user-agent lines. Your syntax for many of the the Disallow lines is incorrect. And the file is HUGE! You can eliminate most of the repetition using User-agent: * Code (markup): And the file as it exists now finishes with User-agent: * Disallow: / Code (markup): which is saying "note to ALL spiders -- do not index ANYTHING". Start over with this robots.txt file: User-agent: * Disallow: /buy/ Disallow: /sell/ Disallow: /message/ Disallow: /news/ Disallow: /account/ Code (markup): and dump everything else.
That's absolute nonsense, sitetutor. If you meant it as some sort of satirical comment, you forgot the smiley.
They may do the opposite of what the webmaster intended to instruct but take a look at the robots.txt file in question -- if Googlebot didn't know how to interpret that mess, you can hardly blame it. I'd like to see even a single example of Googlebot ignoring a properly constructed robots.txt file.
the mayority of webmasters does NOT properly instruct ... that is who the rest is paying for! not the smartest move on G's part but that is what they are doing!
Is mine properly constructed? www.vlead.com/robots.txt http://www.google.com/search?q=site...&rls=org.mozilla:en-US:official&start=30&sa=N http://www.google.com/search?q=site...&rls=org.mozilla:en-US:official&start=20&sa=N
Yes it is. How long has that entry been there? The "Disallow: /extranet" I mean. All I see there is a non-cached log-in page in the first search string. What am I supposed to be looking at with the second search string?
For almost an year now. Basically, it has been there ever since we started the extranet. The first search string was site:vlead.com and the second was site:www.vlead.com
Errors aside, there's another issue that may deserve clarification. The robots file instructs a spider not to index specific files and/or folders. However, I don't believe that means 'do not access, do not request, or do not crawl' these resources. Am I wrong? /*tom*/
Well... seems like I understood it... the problem was I used /buy/ for example wich actually should be */buy * without slash if I want to disallow alll directories and file exstentions within this directory. Now it work well! Minstrel, your advice is good - to use only User-agent: * Disallow: /buy/ Disallow: /sell/ Disallow: /message/ Disallow: /news/ Disallow: /account/ but my robots.txt is correct - it includes only well known robots and excludes the rest to save BW. I checked it through validator http://www.searchengineworld.com/cgi-bin/robotcheck.cgi and it's fine. Again, the key was "slash"! Correct way is: User-agent: * Disallow: /buy Disallow: /sell Disallow: /message Disallow: /news Disallow: /account Thank you all! And check your robots.txt With this hint many of us could get rid of "supplementals"
There are some spiders which ignore robots instructions even if you set the user agent to *. There is also an application called Teleport Ultra (which is an offline browser) which can be instructed to ignore robots instructions when spidering a website. I think Google does indeed visit locations which it is not supposed to but it doesn't index them. I noticed it on my message board though that Google tries to spider excluded directories, but the bot receives a no permission message because the excluded dirs aren't accessible by the IIS guest user.
http://www.robotstxt.org/wc/norobots.html http://www.robotstxt.org/wc/exclusion-admin.html Your robots.txt file still contains numerous invalid user-agent identifiers.
http://www.searchengineworld.com/robots/robots_tutorial.htm See also http://www.searchengineworld.com/misc/robots_txt_crawl.htm for common errors.
Note that the SearchEngineWorld validator does not check for invalid user-agent designations. Your robots.txt file does indeed "validate" according to that script but as an example look at this: This is partial output from the validator containing invalid user-agent lines. Since the validator script doesn't check those lines, it "passes" them but they are not valid.
Thanks, Minstrel! Well... after changes from "/buy/" to "buy" Gb stopped crawling directory and all files. You might be correct with "validator pass" issue. Changed it. Let's see how it will work out. Thanks again!
usandr: note that those are not the only user-agent errors -- just three examples of problem entries. There are several others.