robots.txt pointing to index page

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#1

Would this be enough to get you banned for spam or for another reason? Say your site was www. mydomain .com and you had www. mydomain. com/robots.txt pointing to your index page so that if someone typed it in they would see your home page. Is this bad?

NewComputer, Sep 26, 2004 IP

Smyrl Tomato Republic Staff

Messages:: 13,740

Likes Received:: 1,702

Best Answers:: 78

Trophy Points:: 510

#2

Who knows about robots.txt files other than webmasters or search engines? What would be the advantage of such?

Shannon

Smyrl, Sep 26, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#3

I am thinking it was done by mistake, but when you enter the domain plus the robots.txt you see the index page.

NewComputer, Sep 26, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#4

No one has any input? I thought someone would have an idea here for sure....

NewComputer, Sep 26, 2004 IP

dazzlindonna Peon

Messages:: 553

Likes Received:: 21

Best Answers:: 0

Trophy Points:: 0

#5

It may be confusing the bots, and in that case, then yes, it would be bad.

dazzlindonna, Sep 26, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#6

Thanks Donna, worst case scenario is we change it and it does not change anything.

NewComputer, Sep 26, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#7

Smyrl said:

What would be the purpose?
Click to expand...

That is indeed the question...

I did participate in a forum investigation a few months back (another forum) of a website which "vanished" from Google for no obvious reason other than that the robots.txt file was a copy of the index page (or redirected to the index page). I don't know that the "funny" robots.txt file was the problem but it was the only obviously strange thing about the site.

So go back to Smyrl's question: I don't see what you have gain by doing this and it may have a disadvantage.

So: no real advantage -- uncertain but possible disadvantage.

Why take the chance?

minstrel, Sep 26, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#8

What happened was they removed the robots.txt file, so if you type the url in you are redirected. What I am now wondering is if this would in fact cause spiders to skip off. Do they come in looking for www. mydomain.com/robots.txt or would they already be on the server and just look for the robots.txt file and when they don't find one continue to crawl.

NewComputer, Sep 27, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#9

I don't know -- spiders aren't very bright -- really, all they know how to do is follow links.

On the other hand, look at that: "I don't know". And again ask, "do I want to take the risk?".

If you no longer want a robots.txt file, for whatever reason, you can delete it or have a file that just says
User-agent: *
Disallow:
Code (markup):
That way, there is no risk.

Also, when you say the robots.txt file "redirects" to the home page, how exactly is that done? As an htaccess redirect? or from the robots.txt file itself? Is there still a file titled "robots.txt" in the root directory?

minstrel, Sep 27, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#10

Thanks Minstrel,

There is not a robots.txt file in the directory. When you type www. mydomain.com/robots.txt you are sent to what is a copy of the homepage. This is done by a mod rewrite I believe. I am not exactly sure how it is done, but you can type anything after the .com and it will take you to that page so I am assuming mod rewrite. I am not the webmaster, just trying to help out. I have let them know that they need to put the robots.txt back, but they said they removed it because it was causing one of the big three bots to hit the .txt file and then move on. They said there was nothing in there banning anything and here is what I believe it was:

#User-agent: lycra
#Disallow: /

#User-agent: *
#Disallow: /tmp
#Disallow: /logs

User-agent: *
Disallow:

So, not to sure why they would turn away from that, maybe someone here can help.

NewComputer, Sep 27, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#11

Sounds to me like you may have a custom 404 page which redirects to the homepage, so when the bots come by and attempt to read the robots file instead they are served your home page, which at best is going to be confusing to them, and they will request the robots file every time they come to your site.

There is no reason why a generic robots.txt file which allows spidering of everything would cause any spider to leave, or for that matter a blank robots.txt file.

But since it is not a mindboggling exercise to put in a good robots.txt file which may well save you from having to entertain all the email harvesters who come by, why not set things up right?

Mel, Sep 27, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#12

#User-agent: lycra
#Disallow: /

#User-agent: *
#Disallow: /tmp
#Disallow: /logs

User-agent: *
Disallow:
Code (markup):
is not in the recommended order... should have been
#User-agent: *
#Disallow: /tmp
#Disallow: /logs

#User-agent: lycra
#Disallow: /
Code (markup):
That, by the way, is telling "lycra" to go away and not have access to anything -- was that what they wanted?

Honestly, I would tell them to delete the redirect and either correct the robots.txt file to do what they want it to do or just delete it -- the only ones getting 404 errors would be spiders and they don't care -- if they don't find an instruction to "not spider" they will spider.

minstrel, Sep 27, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#13

Isn't the # put there to indicate do not read?

NewComputer, Sep 27, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#14

Doh! Sharp eyes, there, NewComputer -- I didn't even see it.

Yep. The "#" is a comment, so all of that should be ignored, except for
User-agent: *
Disallow:
Code (markup):
which is fine because it says "spider everything".

minstrel, Sep 27, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#15

yea, but this still does not explain why Yahoo! (inktomie/Slurp) is hitting the robots.txt file and then leaving and not spidering. Can anyone think of an answer?

NewComputer, Sep 27, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#16

Might just be waiting before they do a big crawl... They don't always to the whole lot in one go.

If the robots file was the problem, it could take a few hours/days before they start crawling the whole site again.

SEbasic, Sep 27, 2004 IP

NewComputer Well-Known Member

Messages:: 2,021

Likes Received:: 68

Best Answers:: 0

Trophy Points:: 188

#17

Thanks SE, the robots.txt file was removed. This leads me to believe that the bots look for and not find the robots.txt file, when they don't, they will see in the meta to spider everything and they should resume. We'll see. No action yet today.

NewComputer, Sep 27, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#18

Well, good luck on it

SEbasic, Sep 27, 2004 IP

minstrel Illustrious Member

Messages:: 15,082

Likes Received:: 1,243

Best Answers:: 0

Trophy Points:: 480

#19

1. Slurp is weird. Even if it does spider your site, it won't necessarily make a lot of difference in terms of showing up in Yahoo.

2. Is your site updated/modified regularly? If not, it may be that Slurp comes in, checks for robots.txt exclusions, and then gets the headers looking for last modified date. If it hasn't changed, it will go away (this isn't specific to Slurp, by the way).

minstrel, Sep 27, 2004 IP

gullam18 Peon

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#20

hi all,

my site is http://www.adps-domain.com

and

the below pages are not indexed by google:

http://adps-domain.com/osComm/domain_name_registration.php

http://www.adps-domain.com/osComm/sitemap.php

google is indexing my top page only , it omits the deeper pages.

is any special things to do to get indexing the deeper pages in google?

plz need help?
Thanks in advance.

gullam18, Oct 29, 2004 IP

Log in or Sign up

robots.txt pointing to index page

NewComputer Well-Known Member

Smyrl Tomato Republic Staff

NewComputer Well-Known Member

NewComputer Well-Known Member

dazzlindonna Peon

NewComputer Well-Known Member

minstrel Illustrious Member

NewComputer Well-Known Member

minstrel Illustrious Member

NewComputer Well-Known Member

Mel Peon

minstrel Illustrious Member

NewComputer Well-Known Member

minstrel Illustrious Member

NewComputer Well-Known Member

SEbasic Peon

NewComputer Well-Known Member

SEbasic Peon

minstrel Illustrious Member

gullam18 Peon

Useful Searches