Recently I have finished work on XMLTraining.com and is being spidered well by Google. Googlebot has been obeying all in the robots.txt file except a JavaScript source file. I have coded my own textual advertising system which matches search queries to advertisements: <script src="/thirdparty/?q=xml&x=87436349"></script> Google however is indexing these JS files, the following code is in my robots.txt file: User-agent: * Disallow: /thirdparty/ I have NO idea why. But I'm thinking that maybe because JS uses SRC not HREF that SRC attributes are not checked against the robots file?
google uses cached copies of robots.txt and refreshes them from time to time. It might still use an old copy of your robots.txt. Check out their FAQ section.
Could it be perhaps that the src directive is server side, and is being executed by your web server without Googlebot being aware of it? This is the behavior with the include directive. You can ban Googlebot from seeing your includes, but it does no good, because the files are included by the web server before Googlebot knows what is going on.
There seems to be a strong current of feeling that JavaScript links will eventually all be visible to Google. A simple alternative methodology that ought to work forever is to use a php redirection script and forbid *that* to Google (or any robot). Since the actual link is to the referrer, the robot can be stopped (assuming it is one that honors robots.txt files). You can find a working example available for free download at http://seo-toys.com (the "Via" toy).
some people have claimed that google already crawls through JS links, but I don't think this is the case. we had some pages up for almost a year and a half that ONLY used JS links. the main/gate page was a PR4. none of the links inside it were cached.
It's more likely that somebody just changed his useragent to go fishing for cloaked pages or something... Especially since you said "except a JavaScript source file". And I still dont believe that googlebot already spiders JS links.
mxlabs, i don't get what your saying. The javascript source is in the google index, asin you can search for it, if blocked in robots.txt as it is, it should not appear in the index.
Oh, so google is actually indexing the JS itself? I didn't get that part. In that case I guess it might be because of the SRC instead of HREF as you already mentioned. I'm quite astonished that googlebot can read those JS parts.
This surprises me as well. I've had JS links on sites and have never seen them indexed on Google as yet. This is well worth noting...
Google is grabbing external JavaScript files with a user agent of "Googlebot/Test". There is more info on it over here.