Please help me understanding robots.txt file

MyArtGallery Active Member

Messages:: 418

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 58

#1

Hi

If I add the following robots.txt file does Google will exclude all the URLs starting with index.php inside "dir1" or only the file index.php inside "dir1" directory?

User-agent: *
Disallow: /dir1/index.php
Allow: /

I mean this will disallow only the file www.site/dir1/index.php URL ...or does it means Google will disallow also all the urls that starts with index.php inside the "dir1" directory? for example: www.site/dir1/index.php-product_IDx.html

thanks

MyArtGallery, Sep 6, 2011 IP

prince@life Notable Member

Messages:: 278

Likes Received:: 13

Best Answers:: 3

Trophy Points:: 225

#2

Even i am not powerful in this, but now i have studied about it and answered you.
Use this one for your robots.txt

User-agent: *
Disallow: /dir1/index.php*
Allow: /

symbol * means it will include index.php as well as all URLs That starts with index.php-abc-xyz or index.php-abc-456-hght.html whatever.

prince@life, Sep 12, 2011 IP

christopherscott Peon

Messages:: 32

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

It will disallow to crawler for crawling that urls started by index.php and allow all that pages to crawl leaving index.php.

christopherscott, Sep 15, 2011 IP

jabz.biz Active Member

Messages:: 384

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 70

#4

prince@life said: ↑

symbol * means it will include index.php as well as all URLs That starts with index.php-abc-xyz or index.php-abc-456-hght.html whatever.
Click to expand...

Unfortunately this is complete nonsense. The wildcard (*) on User-agents effects all bots obeying the robots.txt protocol.
There is no wildcard for the Disallow directive!

Learn about robots.txt and the robots exclusion standard here: http://rield.com/cheat-sheets/robots-exclusion-standard-protocol

Also, if you - MyArtGallery - want to get rid of the the index.php files appearing in Google or to users, you should write a rewrite rule for that. Using robots.txt to tackle that issue will not be very effective. What you want to do is to canonicalize your directory index. I wrote a how-to on that, too:

http://rield.com/how-to/directoryindex-canonicalization

All examples work for index.php files as well. Just change .html to .php in both lines of your .htaccess file.

I hope this helps to solve your initial problem.

jabz.biz, Sep 26, 2011 IP

FOREXSIGNAL Active Member

Messages:: 71

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 61

#5

Thanks for sharing.

FOREXSIGNAL, Sep 28, 2011 IP

amherstsowell Peon

Messages:: 261

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#6

It will stop to crawler to crawl all file who have index.php in url.

amherstsowell, Sep 29, 2011 IP

ChildrensEntertainer Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

I put up a robot txt file with a * for pdf files . When I checked with my host they said it wouldn't work as you have to specify each file individually so it seems there are some different opinions on this. So far using * hasn't worked because the file is still indexed

ChildrensEntertainer, Nov 7, 2011 IP

pr0t0n Well-Known Member

Messages:: 243

Likes Received:: 10

Best Answers:: 10

Trophy Points:: 128

#8

ChildrensEntertainer said: ↑

I put up a robot txt file with a * for pdf files . When I checked with my host they said it wouldn't work as you have to specify each file individually so it seems there are some different opinions on this. So far using * hasn't worked because the file is still indexed
Click to expand...

It takes some time to deindex already indexed url. Make sure to give it enough time to update, and for search engines to apply your robots.txt directive. According to Google help, to block all .pdf files you need something like this:
User-agent: *
Disallow: /*.pdf$
Code (markup):
Quote from Google webmasters help:

To block files of a specific file type (for example, .gif), use the following:

User-agent: Googlebot
Disallow: /*.gif$
Click to expand...

pr0t0n, Nov 8, 2011 IP

pr0t0n Well-Known Member

Messages:: 243

Likes Received:: 10

Best Answers:: 10

Trophy Points:: 128

#9

jabz.biz said: ↑

There is no wildcard for the Disallow directive!
Click to expand...

For Googlebot, yes there is. For others I can't say for sure.

http://www.google.com/support/webmasters/bin/answer.py?answer=156449

â€¢To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:

User-agent: Googlebot
Disallow: /private*/

â€¢To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

â€¢To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:

User-agent: Googlebot
Disallow: /*.xls$
Click to expand...

pr0t0n, Nov 8, 2011 IP

pr0t0n Well-Known Member

Messages:: 243

Likes Received:: 10

Best Answers:: 10

Trophy Points:: 128

#10

MyArtGallery said: ↑

Hi

If I add the following robots.txt file does Google will exclude all the URLs starting with index.php inside "dir1" or only the file index.php inside "dir1" directory?

User-agent: *
Disallow: /dir1/index.php
Allow: /

I mean this will disallow only the file www.site/dir1/index.php URL ...or does it means Google will disallow also all the urls that starts with index.php inside the "dir1" directory? for example: www.site/dir1/index.php-product_IDx.html

thanks
Click to expand...

Since you asked for Google specifically (62 days ago....), according to their webmasters help pages, this rule should be, like prince@life perviously mentioned:
User-agent: *
Disallow: /dir1/index.php*
Code (markup):
I haven't tested it, but their help pages say so. I posted the link to Google help page in previous post.

Last edited: Nov 8, 2011

pr0t0n, Nov 8, 2011 IP

Log in or Sign up

Please help me understanding robots.txt file

MyArtGallery Active Member

prince@life Notable Member

christopherscott Peon

jabz.biz Active Member

FOREXSIGNAL Active Member

amherstsowell Peon

ChildrensEntertainer Peon

pr0t0n Well-Known Member

pr0t0n Well-Known Member

pr0t0n Well-Known Member

Useful Searches