Googlebot Not Obeying Robots.txt

Discussion in 'Google' started by xml, May 23, 2004.

  1. #1
    Recently I have finished work on XMLTraining.com and is being spidered well by Google. Googlebot has been obeying all in the robots.txt file except a JavaScript source file.

    I have coded my own textual advertising system which matches search queries to advertisements:

    <script src="/thirdparty/?q=xml&x=87436349"></script>

    Google however is indexing these JS files, the following code is in my robots.txt file:

    User-agent: *
    Disallow: /thirdparty/

    I have NO idea why. But I'm thinking that maybe because JS uses SRC not HREF that SRC attributes are not checked against the robots file?
     
    xml, May 23, 2004 IP
  2. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #2
    google uses cached copies of robots.txt and refreshes them from time to time. It might still use an old copy of your robots.txt. Check out their FAQ section.
     
    nohaber, May 23, 2004 IP
  3. xml

    xml Peon

    Messages:
    254
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    The robots.txt has been unedit all the time online.
     
    xml, May 23, 2004 IP
  4. Voyager

    Voyager Guest

    Messages:
    46
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Could it be perhaps that the src directive is server side, and is being executed by your web server without Googlebot being aware of it?

    This is the behavior with the include directive. You can ban Googlebot from seeing your includes, but it does no good, because the files are included by the web server before Googlebot knows what is going on.
     
    Voyager, May 23, 2004 IP
  5. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #5
    There seems to be a strong current of feeling that JavaScript links will eventually all be visible to Google. A simple alternative methodology that ought to work forever is to use a php redirection script and forbid *that* to Google (or any robot).

    Since the actual link is to the referrer, the robot can be stopped (assuming it is one that honors robots.txt files).

    You can find a working example available for free download at http://seo-toys.com (the "Via" toy).
     
    Owlcroft, May 24, 2004 IP
  6. disgust

    disgust Guest

    Messages:
    2,417
    Likes Received:
    133
    Best Answers:
    0
    Trophy Points:
    0
    #6
    some people have claimed that google already crawls through JS links, but I don't think this is the case.

    we had some pages up for almost a year and a half that ONLY used JS links. the main/gate page was a PR4. none of the links inside it were cached.
     
    disgust, May 24, 2004 IP
  7. mxlabs

    mxlabs Peon

    Messages:
    327
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #7
    It's more likely that somebody just changed his useragent to go fishing for cloaked pages or something... Especially since you said "except a JavaScript source file". And I still dont believe that googlebot already spiders JS links.
     
    mxlabs, May 28, 2004 IP
  8. xml

    xml Peon

    Messages:
    254
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #8
    mxlabs, i don't get what your saying.

    The javascript source is in the google index, asin you can search for it, if blocked in robots.txt as it is, it should not appear in the index.
     
    xml, May 29, 2004 IP
  9. mxlabs

    mxlabs Peon

    Messages:
    327
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Oh, so google is actually indexing the JS itself? I didn't get that part.

    In that case I guess it might be because of the SRC instead of HREF as you already mentioned. I'm quite astonished that googlebot can read those JS parts.
     
    mxlabs, May 29, 2004 IP
  10. North Carolina SEO

    North Carolina SEO Well-Known Member

    Messages:
    1,327
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    105
    #10
    This surprises me as well. I've had JS links on sites and have never seen them indexed on Google as yet. This is well worth noting...
     
    North Carolina SEO, Jun 1, 2004 IP
  11. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #11
    Google is grabbing external JavaScript files with a user agent of "Googlebot/Test". There is more info on it over here.
     
    digitalpoint, Jun 1, 2004 IP
  12. xml

    xml Peon

    Messages:
    254
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Interesting....

    Cheers!
     
    xml, Jun 2, 2004 IP
  13. Alahad

    Alahad Peon

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #13
    yes ... intresting....
     
    Alahad, Jul 31, 2009 IP