Hi If I add the following robots.txt file does Google will exclude all the URLs starting with index.php inside "dir1" or only the file index.php inside "dir1" directory? User-agent: * Disallow: /dir1/index.php Allow: / I mean this will disallow only the file www.site/dir1/index.php URL ...or does it means Google will disallow also all the urls that starts with index.php inside the "dir1" directory? for example: www.site/dir1/index.php-product_IDx.html thanks
Even i am not powerful in this, but now i have studied about it and answered you. Use this one for your robots.txt User-agent: * Disallow: /dir1/index.php* Allow: / symbol * means it will include index.php as well as all URLs That starts with index.php-abc-xyz or index.php-abc-456-hght.html whatever.
It will disallow to crawler for crawling that urls started by index.php and allow all that pages to crawl leaving index.php.
Unfortunately this is complete nonsense. The wildcard (*) on User-agents effects all bots obeying the robots.txt protocol. There is no wildcard for the Disallow directive! Learn about robots.txt and the robots exclusion standard here: http://rield.com/cheat-sheets/robots-exclusion-standard-protocol Also, if you - MyArtGallery - want to get rid of the the index.php files appearing in Google or to users, you should write a rewrite rule for that. Using robots.txt to tackle that issue will not be very effective. What you want to do is to canonicalize your directory index. I wrote a how-to on that, too: http://rield.com/how-to/directoryindex-canonicalization All examples work for index.php files as well. Just change .html to .php in both lines of your .htaccess file. I hope this helps to solve your initial problem.
I put up a robot txt file with a * for pdf files . When I checked with my host they said it wouldn't work as you have to specify each file individually so it seems there are some different opinions on this. So far using * hasn't worked because the file is still indexed
It takes some time to deindex already indexed url. Make sure to give it enough time to update, and for search engines to apply your robots.txt directive. According to Google help, to block all .pdf files you need something like this: User-agent: * Disallow: /*.pdf$ Code (markup): Quote from Google webmasters help:
For Googlebot, yes there is. For others I can't say for sure. http://www.google.com/support/webmasters/bin/answer.py?answer=156449
Since you asked for Google specifically (62 days ago....), according to their webmasters help pages, this rule should be, like prince@life perviously mentioned: User-agent: * Disallow: /dir1/index.php* Code (markup): I haven't tested it, but their help pages say so. I posted the link to Google help page in previous post.