Gbot is crawling disallowed directories...

Discussion in 'Google' started by usandr, Feb 9, 2005.

  1. #1
    OK... here is a a question/problem and I would greatly appreciate any
    help!

    Site is in sig - homesalewizard.

    Robots.txt is set as:

    User-agent: Googlebot
    Disallow: /buy/
    Disallow: /sell/

    Those are directories mostly for user's accounts.
    Googlebot continues to crawl through them...


    So the question is - if */buy/* disallowed would it automatically
    exclude something like */buy/savelisting.php?homeid=191* ?

    I feel like we are in the middle of mess with indexing. G's used to show
    internal pages in SERP and seems like it's not any longer.

    Thanks for help!
     
    usandr, Feb 9, 2005 IP
  2. DVDsPlusMore

    DVDsPlusMore Guest

    Messages:
    14
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Hmmmmm ... very puzzling situation. I'm not sure if I've resolved it or not, but here's some help with your problem-solving.

    Using the validation tool at SearchEngineWorld, I checked the syntax of your robots.txt file and found no obvious errors.

    I then doublechecked Google's advice to webmasters on this topic. They have some helpful instructions -- all of which you seem to be following -- in their webmaster FAQs.

    So ... no obvious problems that I could see. Any thoughts from web gurus more experienced with these issues?

    Best,

    - James
     
    DVDsPlusMore, Feb 10, 2005 IP
  3. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #3
    Your robots.txt file is all messed up now... I don't know if it looked like this when Jamews tried to validate it but it's full of errors now.

    For one thing, you have invalid user-agent designations as well as comments in the user-agent lines. Your syntax for many of the the Disallow lines is incorrect. And the file is HUGE! You can eliminate most of the repetition using
    User-agent: *
    Code (markup):
    And the file as it exists now finishes with
    User-agent: *
    Disallow: / 
    Code (markup):
    which is saying "note to ALL spiders -- do not index ANYTHING".

    Start over with this robots.txt file:
    User-agent: *
    Disallow: /buy/
    Disallow: /sell/
    Disallow: /message/
    Disallow: /news/
    Disallow: /account/
    
    Code (markup):
    and dump everything else.
     
    minstrel, Feb 10, 2005 IP
  4. Blogmaster

    Blogmaster Blood Type Dating Affiliate Manager

    Messages:
    25,924
    Likes Received:
    1,354
    Best Answers:
    0
    Trophy Points:
    380
    #4
    Google completely disregards robot instructions ... they don't want to be told what to do
     
    Blogmaster, Feb 10, 2005 IP
  5. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #5
    That's absolute nonsense, sitetutor.

    If you meant it as some sort of satirical comment, you forgot the smiley.
     
    minstrel, Feb 10, 2005 IP
  6. Blogmaster

    Blogmaster Blood Type Dating Affiliate Manager

    Messages:
    25,924
    Likes Received:
    1,354
    Best Answers:
    0
    Trophy Points:
    380
    #6
    I have seen examples of where they do the opposite of what they were instructed
     
    Blogmaster, Feb 10, 2005 IP
  7. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #7
    They may do the opposite of what the webmaster intended to instruct but take a look at the robots.txt file in question -- if Googlebot didn't know how to interpret that mess, you can hardly blame it.

    I'd like to see even a single example of Googlebot ignoring a properly constructed robots.txt file.
     
    minstrel, Feb 10, 2005 IP
  8. Blogmaster

    Blogmaster Blood Type Dating Affiliate Manager

    Messages:
    25,924
    Likes Received:
    1,354
    Best Answers:
    0
    Trophy Points:
    380
    #8
    the mayority of webmasters does NOT properly instruct ... that is who the rest is paying for! not the smartest move on G's part but that is what they are doing!
     
    Blogmaster, Feb 10, 2005 IP
  9. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #9
    :confused:

    who is paying for what? what "move on G's part" isn't smart?
     
    minstrel, Feb 10, 2005 IP
  10. vlead

    vlead Peon

    Messages:
    215
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    vlead, Feb 10, 2005 IP
  11. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #11
    Yes it is. How long has that entry been there? The "Disallow: /extranet" I mean.

    All I see there is a non-cached log-in page in the first search string.

    What am I supposed to be looking at with the second search string?
     
    minstrel, Feb 10, 2005 IP
  12. vlead

    vlead Peon

    Messages:
    215
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    For almost an year now. Basically, it has been there ever since we started the extranet.

    The first search string was site:vlead.com and the second was site:www.vlead.com
     
    vlead, Feb 10, 2005 IP
  13. longcall911

    longcall911 Peon

    Messages:
    1,672
    Likes Received:
    87
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Errors aside, there's another issue that may deserve clarification. The robots file instructs a spider not to index specific files and/or folders.

    However, I don't believe that means 'do not access, do not request, or do not crawl' these resources.

    Am I wrong?

    /*tom*/
     
    longcall911, Feb 11, 2005 IP
  14. usandr

    usandr Germes

    Messages:
    57
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Well... seems like I understood it...

    the problem was I used /buy/ for example wich actually should be */buy * without slash if I want to disallow alll directories and file exstentions within this directory.

    Now it work well!

    Minstrel, your advice is good - to use only
    User-agent: *
    Disallow: /buy/
    Disallow: /sell/
    Disallow: /message/
    Disallow: /news/
    Disallow: /account/

    but my robots.txt is correct - it includes only well known robots and excludes the rest to save BW.

    I checked it through validator
    http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

    and it's fine.

    Again, the key was "slash"!

    Correct way is:

    User-agent: *
    Disallow: /buy
    Disallow: /sell
    Disallow: /message
    Disallow: /news
    Disallow: /account

    Thank you all! And check your robots.txt
    With this hint many of us could get rid of "supplementals" ;)
     
    usandr, Feb 11, 2005 IP
  15. Chrissicom

    Chrissicom Guest

    Messages:
    261
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #15
    There are some spiders which ignore robots instructions even if you set the user agent to *. There is also an application called Teleport Ultra (which is an offline browser) which can be instructed to ignore robots instructions when spidering a website. I think Google does indeed visit locations which it is not supposed to but it doesn't index them. I noticed it on my message board though that Google tries to spider excluded directories, but the bot receives a no permission message because the excluded dirs aren't accessible by the IIS guest user.
     
    Chrissicom, Feb 11, 2005 IP
  16. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #16
    http://www.robotstxt.org/wc/norobots.html
    http://www.robotstxt.org/wc/exclusion-admin.html
    Your robots.txt file still contains numerous invalid user-agent identifiers.
     
    minstrel, Feb 11, 2005 IP
  17. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #17
    http://www.searchengineworld.com/robots/robots_tutorial.htm
    See also http://www.searchengineworld.com/misc/robots_txt_crawl.htm for common errors.
     
    minstrel, Feb 11, 2005 IP
  18. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #18
    Note that the SearchEngineWorld validator does not check for invalid user-agent designations. Your robots.txt file does indeed "validate" according to that script but as an example look at this:

    This is partial output from the validator containing invalid user-agent lines. Since the validator script doesn't check those lines, it "passes" them but they are not valid.
     
    minstrel, Feb 11, 2005 IP
  19. usandr

    usandr Germes

    Messages:
    57
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Thanks, Minstrel!
    Well... after changes from "/buy/" to "buy" Gb stopped crawling directory and all files.
    You might be correct with "validator pass" issue. Changed it.
    Let's see how it will work out.

    Thanks again!
     
    usandr, Feb 11, 2005 IP
  20. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #20
    usandr: note that those are not the only user-agent errors -- just three examples of problem entries. There are several others.
     
    minstrel, Feb 11, 2005 IP