Does anyone know how to stop Google from indexing Javascript files? The folder where all of my Javascript files are in is explicitly disallowed in my robots.txt and has been since day one. However, robots.txt is technically being respected because Google isn't actually accessing my script files, it is merely indexing them (having found their URLs by looking at the src attribute of my script tags). But I want it to stop, because it's quite ridiculous. How does one prevent this? I can use Webmaster Tools to remove them after they have shown up, but that's not very helpful especially if I add more script files in the future. Ideas?
vpguy, Interesting. I am not quite sure I follow, but I think you are saying that google is listing the URL of the javascript files when you do some particular search on google? If so, first make sure it really is being properly disallowed in the robots.txt file and make sure there are no errors in the robots.txt file by running it thru a validator, ie do a search on google for robots.txt validator I find it interesting that the search is producing results for the url.......if your javascript files have unique names, and there are NO PARTIAL MATCHES for that phrase in any other directory or file on your site, you can actually block the filename from being indexed by disallowing that filename in the robots.txt ; dont have to include the .js just part of the filename. but be careful about this technique, and make sure there are no other patterns. the other thing that may work is to name the files something nonsensical, like a hexadecimal pattern name in lowercase. Basically, nonlanguage filenames that could never be searched on, like 6y2wz6qxcmn.js something strange like that so no query on google would produce the result. That is all i can think of at the moment. Other than cloaking, which I dont recommend, whether it be useragent, IP, or object detection. Best left to the pros because it can cause problems if discovered. If you like, be more specific and perhaps more ideas will come to mind.
My robots.txt passes the validators (as it should; it's very small and simple). Here are the two relevant lines: My CSS and JavaScript files files are located in the /gui/ folder. It has been disallowed since I launched the site. The JavaScript files have only recently started appearing in the index. To the best of my knowledge, they only show up when doing a site:mydomain.com type of search. It is irritating because I went to great lengths to make sure that none of my pages have identical titles/meta data/content, specifically to avoid the duplicate-content message: In order to show you the most relevant results, we have omitted some entries very similar to the ones already displayed. If you like, you can repeat the search with the omitted results included. Now this message appears (when doing a site:mydomain.com search), because the JavaScript files are showing up as URL's only. Two or more pages which have no content (because they were not crawled) are considered to be duplicates, thereby triggering this message. I guess I have two problems with this: 1. Restricting files (and folders) in robots.txt does not prevent things from being indexed, it only prevents them from being crawled. Doesn't that make robots.txt about 50% useless? 2. The fact that Google thinks that URLs found within <script> tags need to be indexed - do script files normally contain really juicy content?
Actually, google may crawl and parse any file it wishes, and the robots.txt does not stop them from doing so. What they are supposed to do, is to not index it. That is what they are supposed to abide by. No, javascript isnt supposed to be juicy.