Would this be enough to get you banned for spam or for another reason? Say your site was www. mydomain .com and you had www. mydomain. com/robots.txt pointing to your index page so that if someone typed it in they would see your home page. Is this bad?
Who knows about robots.txt files other than webmasters or search engines? What would be the advantage of such? Shannon
I am thinking it was done by mistake, but when you enter the domain plus the robots.txt you see the index page.
That is indeed the question... I did participate in a forum investigation a few months back (another forum) of a website which "vanished" from Google for no obvious reason other than that the robots.txt file was a copy of the index page (or redirected to the index page). I don't know that the "funny" robots.txt file was the problem but it was the only obviously strange thing about the site. So go back to Smyrl's question: I don't see what you have gain by doing this and it may have a disadvantage. So: no real advantage -- uncertain but possible disadvantage. Why take the chance?
What happened was they removed the robots.txt file, so if you type the url in you are redirected. What I am now wondering is if this would in fact cause spiders to skip off. Do they come in looking for www. mydomain.com/robots.txt or would they already be on the server and just look for the robots.txt file and when they don't find one continue to crawl.
I don't know -- spiders aren't very bright -- really, all they know how to do is follow links. On the other hand, look at that: "I don't know". And again ask, "do I want to take the risk?". If you no longer want a robots.txt file, for whatever reason, you can delete it or have a file that just says User-agent: * Disallow: Code (markup): That way, there is no risk. Also, when you say the robots.txt file "redirects" to the home page, how exactly is that done? As an htaccess redirect? or from the robots.txt file itself? Is there still a file titled "robots.txt" in the root directory?
Thanks Minstrel, There is not a robots.txt file in the directory. When you type www. mydomain.com/robots.txt you are sent to what is a copy of the homepage. This is done by a mod rewrite I believe. I am not exactly sure how it is done, but you can type anything after the .com and it will take you to that page so I am assuming mod rewrite. I am not the webmaster, just trying to help out. I have let them know that they need to put the robots.txt back, but they said they removed it because it was causing one of the big three bots to hit the .txt file and then move on. They said there was nothing in there banning anything and here is what I believe it was: #User-agent: lycra #Disallow: / #User-agent: * #Disallow: /tmp #Disallow: /logs User-agent: * Disallow: So, not to sure why they would turn away from that, maybe someone here can help.
Sounds to me like you may have a custom 404 page which redirects to the homepage, so when the bots come by and attempt to read the robots file instead they are served your home page, which at best is going to be confusing to them, and they will request the robots file every time they come to your site. There is no reason why a generic robots.txt file which allows spidering of everything would cause any spider to leave, or for that matter a blank robots.txt file. But since it is not a mindboggling exercise to put in a good robots.txt file which may well save you from having to entertain all the email harvesters who come by, why not set things up right?
#User-agent: lycra #Disallow: / #User-agent: * #Disallow: /tmp #Disallow: /logs User-agent: * Disallow: Code (markup): is not in the recommended order... should have been #User-agent: * #Disallow: /tmp #Disallow: /logs #User-agent: lycra #Disallow: / Code (markup): That, by the way, is telling "lycra" to go away and not have access to anything -- was that what they wanted? Honestly, I would tell them to delete the redirect and either correct the robots.txt file to do what they want it to do or just delete it -- the only ones getting 404 errors would be spiders and they don't care -- if they don't find an instruction to "not spider" they will spider.
Doh! Sharp eyes, there, NewComputer -- I didn't even see it. Yep. The "#" is a comment, so all of that should be ignored, except for User-agent: * Disallow: Code (markup): which is fine because it says "spider everything".
yea, but this still does not explain why Yahoo! (inktomie/Slurp) is hitting the robots.txt file and then leaving and not spidering. Can anyone think of an answer?
Might just be waiting before they do a big crawl... They don't always to the whole lot in one go. If the robots file was the problem, it could take a few hours/days before they start crawling the whole site again.
Thanks SE, the robots.txt file was removed. This leads me to believe that the bots look for and not find the robots.txt file, when they don't, they will see in the meta to spider everything and they should resume. We'll see. No action yet today.
1. Slurp is weird. Even if it does spider your site, it won't necessarily make a lot of difference in terms of showing up in Yahoo. 2. Is your site updated/modified regularly? If not, it may be that Slurp comes in, checks for robots.txt exclusions, and then gets the headers looking for last modified date. If it hasn't changed, it will go away (this isn't specific to Slurp, by the way).
hi all, my site is http://www.adps-domain.com and the below pages are not indexed by google: http://adps-domain.com/osComm/domain_name_registration.php http://www.adps-domain.com/osComm/sitemap.php google is indexing my top page only , it omits the deeper pages. is any special things to do to get indexing the deeper pages in google? plz need help? Thanks in advance.