Gbot is crawling disallowed directories...

usandr Germes

Messages:: 57

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

OK... here is a a question/problem and I would greatly appreciate any
help!

Site is in sig - homesalewizard.

Robots.txt is set as:

User-agent: Googlebot
Disallow: /buy/
Disallow: /sell/

Those are directories mostly for user's accounts.
Googlebot continues to crawl through them...

66.249.71.18 - - [07/Feb/2005:23:58:23 -0600] "GET /robots.txt
HTTP/1.0" 200 8408 "-" "Googlebot/2.1
(+http://www.google.com/bot.html)"
66.249.71.18 - - [07/Feb/2005:23:58:24 -0600] "GET /buy/ HTTP/1.0" 200
24336 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:11:18 -0600] "GET
/message.php?userid=&homeid=3117&photoid=6452&menu=news&menu_open=1&agentid=725&active=email
HTTP/1.1" 200 2038 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
66.249.64.33 - - [08/Feb/2005:00:26:13 -0600] "GET
/about.php?active=tou HTTP/1.0" 200 28393 "-" "Googlebot/2.1
(+http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:41:00 -0600] "GET
/buy/rclick.php?i=2&a=756656420 HTTP/1.1" 302 16313 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:41:01 -0600] "GET
/homes-for-sale/listing/.html HTTP/1.1" 404 24611 "-" "Mozilla/5.0
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:41:27 -0600] "GET
/buy/rclick.php?i=1&a=363681701 HTTP/1.1" 302 16375 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:42:09 -0600] "GET
/buy/savelisting.php?homeid=191 HTTP/1.1" 200 16222 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
66.249.66.76 - - [08/Feb/2005:00:42:33 -0600] "GET
/buy/savelisting.php?homeid=542 HTTP/1.1" 200 16270 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
Click to expand...

So the question is - if */buy/* disallowed would it automatically
exclude something like */buy/savelisting.php?homeid=191* ?

I feel like we are in the middle of mess with indexing. G's used to show
internal pages in SERP and seems like it's not any longer.

Thanks for help!

usandr, Feb 9, 2005 IP

DVDsPlusMore Guest

Messages:: 14

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#2

Hmmmmm ... very puzzling situation. I'm not sure if I've resolved it or not, but here's some help with your problem-solving.

Using the validation tool at SearchEngineWorld, I checked the syntax of your robots.txt file and found no obvious errors.

I then doublechecked Google's advice to webmasters on this topic. They have some helpful instructions -- all of which you seem to be following -- in their webmaster FAQs.

So ... no obvious problems that I could see. Any thoughts from web gurus more experienced with these issues?

Best,

- James

DVDsPlusMore, Feb 10, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#3

Your robots.txt file is all messed up now... I don't know if it looked like this when Jamews tried to validate it but it's full of errors now.

For one thing, you have invalid user-agent designations as well as comments in the user-agent lines. Your syntax for many of the the Disallow lines is incorrect. And the file is HUGE! You can eliminate most of the repetition using
User-agent: *
Code (markup):
And the file as it exists now finishes with
User-agent: *
Disallow: / 
Code (markup):
which is saying "note to ALL spiders -- do not index ANYTHING".

Start over with this robots.txt file:
User-agent: *
Disallow: /buy/
Disallow: /sell/
Disallow: /message/
Disallow: /news/
Disallow: /account/
Code (markup):
and dump everything else.

minstrel, Feb 10, 2005 IP

Blogmaster Blood Type Dating Affiliate Manager

Messages:: 25,924

Likes Received:: 1,354

Best Answers:: 0

Trophy Points:: 380

#4

Google completely disregards robot instructions ... they don't want to be told what to do

Blogmaster, Feb 10, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#5

sitetutor said:

Google completely disregards robot instructions ... they don't want to be told what to do
Click to expand...

That's absolute nonsense, sitetutor.

If you meant it as some sort of satirical comment, you forgot the smiley.

minstrel, Feb 10, 2005 IP

Blogmaster Blood Type Dating Affiliate Manager

Messages:: 25,924

Likes Received:: 1,354

Best Answers:: 0

Trophy Points:: 380

#6

I have seen examples of where they do the opposite of what they were instructed

Blogmaster, Feb 10, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#7

They may do the opposite of what the webmaster intended to instruct but take a look at the robots.txt file in question -- if Googlebot didn't know how to interpret that mess, you can hardly blame it.

I'd like to see even a single example of Googlebot ignoring a properly constructed robots.txt file.

minstrel, Feb 10, 2005 IP

Blogmaster Blood Type Dating Affiliate Manager

Messages:: 25,924

Likes Received:: 1,354

Best Answers:: 0

Trophy Points:: 380

#8

the mayority of webmasters does NOT properly instruct ... that is who the rest is paying for! not the smartest move on G's part but that is what they are doing!

Blogmaster, Feb 10, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#9

sitetutor said:

the mayority of webmasters does NOT properly instruct ... that is who the rest is paying for! not the smartest move on G's part but that is what they are doing!
Click to expand...

who is paying for what? what "move on G's part" isn't smart?

minstrel, Feb 10, 2005 IP

vlead Peon

Messages:: 215

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#10

minstrel said:

I'd like to see even a single example of Googlebot ignoring a properly constructed robots.txt file.
Click to expand...

Is mine properly constructed?
www.vlead.com/robots.txt

http://www.google.com/search?q=site...&rls=org.mozilla:en-US:official&start=30&sa=N

http://www.google.com/search?q=site...&rls=org.mozilla:en-US:official&start=20&sa=N

vlead, Feb 10, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#11

Yes it is. How long has that entry been there? The "Disallow: /extranet" I mean.

All I see there is a non-cached log-in page in the first search string.

What am I supposed to be looking at with the second search string?

minstrel, Feb 10, 2005 IP

vlead Peon

Messages:: 215

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#12

For almost an year now. Basically, it has been there ever since we started the extranet.

The first search string was site:vlead.com and the second was site:www.vlead.com

vlead, Feb 10, 2005 IP

longcall911 Peon

Messages:: 1,672

Likes Received:: 87

Best Answers:: 0

Trophy Points:: 0

#13

Errors aside, there's another issue that may deserve clarification. The robots file instructs a spider not to index specific files and/or folders.

However, I don't believe that means 'do not access, do not request, or do not crawl' these resources.

Am I wrong?

/*tom*/

longcall911, Feb 11, 2005 IP

usandr Germes

Messages:: 57

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#14

Well... seems like I understood it...

the problem was I used /buy/ for example wich actually should be */buy * without slash if I want to disallow alll directories and file exstentions within this directory.

Now it work well!

Minstrel, your advice is good - to use only
User-agent: *
Disallow: /buy/
Disallow: /sell/
Disallow: /message/
Disallow: /news/
Disallow: /account/

but my robots.txt is correct - it includes only well known robots and excludes the rest to save BW.

I checked it through validator
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

and it's fine.

Again, the key was "slash"!

Correct way is:

User-agent: *
Disallow: /buy
Disallow: /sell
Disallow: /message
Disallow: /news
Disallow: /account

Thank you all! And check your robots.txt
With this hint many of us could get rid of "supplementals"

usandr, Feb 11, 2005 IP

Chrissicom Guest

Messages:: 261

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 0

#15

There are some spiders which ignore robots instructions even if you set the user agent to *. There is also an application called Teleport Ultra (which is an offline browser) which can be instructed to ignore robots instructions when spidering a website. I think Google does indeed visit locations which it is not supposed to but it doesn't index them. I noticed it on my message board though that Google tries to spider excluded directories, but the bot receives a no permission message because the excluded dirs aren't accessible by the IIS guest user.

Chrissicom, Feb 11, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#16

http://www.robotstxt.org/wc/norobots.html

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:
--------------------------------------------------------------------------------

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/
Disallow: /tmp/
Disallow: /foo.html
Click to expand...

http://www.robotstxt.org/wc/exclusion-admin.html

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

Or create an empty "/robots.txt" file.

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

To exclude a single robot
User-agent: BadBot
Disallow: /

To allow a single robot
User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/docs/

Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
Click to expand...

Your robots.txt file still contains numerous invalid user-agent identifiers.

minstrel, Feb 11, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#17

http://www.searchengineworld.com/robots/robots_tutorial.htm

User-agent
The User-agent line specifies the robot. For example:

User-agent: googlebot

You may also use the wildcard charcter "*" to specify all robots:

User-agent: *

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.
Disallow:
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:

Disallow: email.htm

You may also specify directories:
Disallow: /cgi-bin/

Which would block spiders from your cgi-bin directory.

There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/index.html (both the file bob and files in the bob directory will not be indexed).
Click to expand...

See also http://www.searchengineworld.com/misc/robots_txt_crawl.htm for common errors.

minstrel, Feb 11, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#18

Note that the SearchEngineWorld validator does not check for invalid user-agent designations. Your robots.txt file does indeed "validate" according to that script but as an example look at this:

27 User-agent: Due to a deficiency in Java it"s not currently possible to set the User-agent.
28 Disallow:
29
30 User-agent: no
31 Disallow:
32
33 User-agent: "Ahoy! The Homepage Finder"
34 Disallow:
Click to expand...

This is partial output from the validator containing invalid user-agent lines. Since the validator script doesn't check those lines, it "passes" them but they are not valid.

minstrel, Feb 11, 2005 IP

usandr Germes

Messages:: 57

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#19

minstrel said:

Note that the SearchEngineWorld validator does not check for invalid user-agent designations. Your robots.txt file does indeed "validate" according to that script but as an example look at this: .....

This is partial output from the validator containing invalid user-agent lines. Since the validator script doesn't check those lines, it "passes" them but they are not valid.
Click to expand...

Thanks, Minstrel!
Well... after changes from "/buy/" to "buy" Gb stopped crawling directory and all files.
You might be correct with "validator pass" issue. Changed it.
Let's see how it will work out.

Thanks again!

usandr, Feb 11, 2005 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#20

usandr: note that those are not the only user-agent errors -- just three examples of problem entries. There are several others.

minstrel, Feb 11, 2005 IP

Log in or Sign up

Gbot is crawling disallowed directories...

usandr Germes

DVDsPlusMore Guest

minstrel Illustrious Member

Blogmaster Blood Type Dating Affiliate Manager

minstrel Illustrious Member

Blogmaster Blood Type Dating Affiliate Manager

minstrel Illustrious Member

Blogmaster Blood Type Dating Affiliate Manager

minstrel Illustrious Member

vlead Peon

minstrel Illustrious Member

vlead Peon

longcall911 Peon

usandr Germes

Chrissicom Guest

minstrel Illustrious Member

minstrel Illustrious Member

minstrel Illustrious Member

usandr Germes

minstrel Illustrious Member

Useful Searches