robots.txt issues

Discussion in 'Search Engine Optimization' started by mad4, Dec 7, 2006.

  1. #1
    Google is indexing urls that I have blocked using robots.txt.:confused:

    If I use Disallow: /page.php should that stop page.php?id=123 as well. Its not working.

    Google sitemaps says "URLs restricted by robots.txt 173" and yet all these 173 pages are indexed.

    Anyone else noticing this?

    From Google
     
    mad4, Dec 7, 2006 IP
  2. dilute

    dilute Peon

    Messages:
    232
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #2
    If they've been indexed once when they weren't mentioned in robots.txt it takes a long time for them to be removed and they resurface occasionally. Use the google form to request removal of that page. Also, you may want to try Disallow: /page.php? which google uses in its own robots.txt or even /page.php?*
     
    dilute, Dec 7, 2006 IP
  3. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks. I'm sure Google can't have grabbed the urls before they were blocked.
     
    mad4, Dec 7, 2006 IP
  4. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #4
    mad4, Dec 7, 2006 IP
  5. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #5
    that actually is a pretty old issue with Google
    entire folders excluded by robots.txt still show up at least as URLs when searching G - for example MANY years excluded cgi-bin and others as well.

    the URL may still and forever exists in G index but most likely the page itself may never appear in G cache

    to REMOVE URLs from G index the remove URL procedure at G needs to be used BUT the remove URL procedure at G requires a 404 being served to a request!

    other excluded URL reapear again and again as long as other web sites in the web have a link to that URL
    G position on this is strict - I just recently had a mail exchange with G about NON-existing folders - G refused to remove the ten thousands of URLs giving a 404 because a half a dozen backlinks still exist out there ... a remove URL procedure in such cases brings a temporary solution and a few months later that none- existing folder will reappear again ( did so this year ) as long as a backlink exist to that non-existing file or folder
     
    hans, Dec 7, 2006 IP
  6. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Having done some research this is an old issue, just never happened to me before.

    To summarise: If you block Google access to certain url's it will not crawl them. This apparently does not mean then won't be indexed. Google will index them with as much information as it knows. In most cases this is just the url but it can also be DMOZ titles if its the main site.

    Seems pretty stupid to me.
     
    mad4, Dec 7, 2006 IP
  7. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #7
    I agree mad4
    its stupid to site owners

    but it is a smart GIFT TO hackers from G
    all they often need is to find URLs to certain tools / installed SW names and G successfully offers such info to all hackers ( I had early this year repeated hacker intrusions as a direct result of such robots.txt-excluded URL originating from G search )

    the MORE URL a SE has the higher the stock value of that SE I guess and that might be all that matters to G as well ... $
     
    hans, Dec 7, 2006 IP
  8. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #8
    mad4, Dec 7, 2006 IP