Myspace Layouts - Mortgages - Xbox Mod Chip - Power Tools - Personal Loans

PDA

View Full Version : robots.txt pointing to index page


NewComputer
Sep 26th 2004, 4:08 pm
Would this be enough to get you banned for spam or for another reason? Say your site was www. mydomain .com and you had www. mydomain. com/robots.txt pointing to your index page so that if someone typed it in they would see your home page. Is this bad?

Smyrl
Sep 26th 2004, 4:18 pm
Who knows about robots.txt files other than webmasters or search engines? What would be the advantage of such?

Shannon

NewComputer
Sep 26th 2004, 4:20 pm
I am thinking it was done by mistake, but when you enter the domain plus the robots.txt you see the index page.

NewComputer
Sep 26th 2004, 5:32 pm
No one has any input? I thought someone would have an idea here for sure....

dazzlindonna
Sep 26th 2004, 6:31 pm
It may be confusing the bots, and in that case, then yes, it would be bad.

NewComputer
Sep 26th 2004, 7:02 pm
Thanks Donna, worst case scenario is we change it and it does not change anything.

minstrel
Sep 26th 2004, 11:25 pm
What would be the purpose?
That is indeed the question...

I did participate in a forum investigation a few months back (another forum) of a website which "vanished" from Google for no obvious reason other than that the robots.txt file was a copy of the index page (or redirected to the index page). I don't know that the "funny" robots.txt file was the problem but it was the only obviously strange thing about the site.

So go back to Smyrl's question: I don't see what you have gain by doing this and it may have a disadvantage.

So: no real advantage -- uncertain but possible disadvantage.

Why take the chance?

NewComputer
Sep 27th 2004, 4:14 am
What happened was they removed the robots.txt file, so if you type the url in you are redirected. What I am now wondering is if this would in fact cause spiders to skip off. Do they come in looking for www. mydomain.com/robots.txt or would they already be on the server and just look for the robots.txt file and when they don't find one continue to crawl.

minstrel
Sep 27th 2004, 5:57 am
I don't know -- spiders aren't very bright -- really, all they know how to do is follow links.

On the other hand, look at that: "I don't know". And again ask, "do I want to take the risk?".

If you no longer want a robots.txt file, for whatever reason, you can delete it or have a file that just says User-agent: *
Disallow:That way, there is no risk.

Also, when you say the robots.txt file "redirects" to the home page, how exactly is that done? As an htaccess redirect? or from the robots.txt file itself? Is there still a file titled "robots.txt" in the root directory?

NewComputer
Sep 27th 2004, 6:04 am
Thanks Minstrel,

There is not a robots.txt file in the directory. When you type www. mydomain.com/robots.txt you are sent to what is a copy of the homepage. This is done by a mod rewrite I believe. I am not exactly sure how it is done, but you can type anything after the .com and it will take you to that page so I am assuming mod rewrite. I am not the webmaster, just trying to help out. I have let them know that they need to put the robots.txt back, but they said they removed it because it was causing one of the big three bots to hit the .txt file and then move on. They said there was nothing in there banning anything and here is what I believe it was:

#User-agent: lycra
#Disallow: /

#User-agent: *
#Disallow: /tmp
#Disallow: /logs

User-agent: *
Disallow:

So, not to sure why they would turn away from that, maybe someone here can help.

Mel
Sep 27th 2004, 6:12 am
Sounds to me like you may have a custom 404 page which redirects to the homepage, so when the bots come by and attempt to read the robots file instead they are served your home page, which at best is going to be confusing to them, and they will request the robots file every time they come to your site.

There is no reason why a generic robots.txt file which allows spidering of everything would cause any spider to leave, or for that matter a blank robots.txt file.

But since it is not a mindboggling exercise to put in a good robots.txt file which may well save you from having to entertain all the email harvesters who come by, why not set things up right?

minstrel
Sep 27th 2004, 6:29 am
#User-agent: lycra
#Disallow: /

#User-agent: *
#Disallow: /tmp
#Disallow: /logs

User-agent: *
Disallow:
is not in the recommended order... should have been#User-agent: *
#Disallow: /tmp
#Disallow: /logs

#User-agent: lycra
#Disallow: /
That, by the way, is telling "lycra" to go away and not have access to anything -- was that what they wanted?

Honestly, I would tell them to delete the redirect and either correct the robots.txt file to do what they want it to do or just delete it -- the only ones getting 404 errors would be spiders and they don't care -- if they don't find an instruction to "not spider" they will spider.

NewComputer
Sep 27th 2004, 8:56 am
Isn't the # put there to indicate do not read?

minstrel
Sep 27th 2004, 9:11 am
Doh! Sharp eyes, there, NewComputer -- I didn't even see it.

Yep. The "#" is a comment, so all of that should be ignored, except for
User-agent: *
Disallow: which is fine because it says "spider everything".

NewComputer
Sep 27th 2004, 9:38 am
yea, but this still does not explain why Yahoo! (inktomie/Slurp) is hitting the robots.txt file and then leaving and not spidering. Can anyone think of an answer?

SEbasic
Sep 27th 2004, 9:40 am
Might just be waiting before they do a big crawl... They don't always to the whole lot in one go.

If the robots file was the problem, it could take a few hours/days before they start crawling the whole site again.

NewComputer
Sep 27th 2004, 9:42 am
Thanks SE, the robots.txt file was removed. This leads me to believe that the bots look for and not find the robots.txt file, when they don't, they will see in the meta to spider everything and they should resume. We'll see. No action yet today.

SEbasic
Sep 27th 2004, 9:46 am
Well, good luck on it ;)

minstrel
Sep 27th 2004, 10:02 am
1. Slurp is weird. Even if it does spider your site, it won't necessarily make a lot of difference in terms of showing up in Yahoo.

2. Is your site updated/modified regularly? If not, it may be that Slurp comes in, checks for robots.txt exclusions, and then gets the headers looking for last modified date. If it hasn't changed, it will go away (this isn't specific to Slurp, by the way).

gullam18
Oct 29th 2004, 4:56 am
:) hi all,

my site is http://www.adps-domain.com

and

the below pages are not indexed by google:

http://adps-domain.com/osComm/domain_name_registration.php

http://www.adps-domain.com/osComm/sitemap.php


google is indexing my top page only :confused: , it omits the deeper pages.

is any special things to do to get indexing the deeper pages in google?

plz need help?
Thanks in advance.

minstrel
Oct 29th 2004, 6:35 am
The main page was cached October 28, 2004 -- how long has this site been up?

I could find only 3 backlinks, 2 from forums.

If it is a new site, Google has found it and will eventually be back to try to spider the remaining pages. In the meantime, you need to submit to directories and see what backlinks you can arrange.

leeds1
Oct 29th 2004, 7:13 am
Doh! Sharp eyes, there, NewComputer -- I didn't even see it.

Yep. The "#" is a comment, so all of that should be ignored, except for
User-agent: *
Disallow: which is fine because it says "spider everything".

Does that code above say to every useragent (ie: every spider) please do not come to my site rather than spider everything

I think that's why you are not showing - the code is incorrect

<edit> My mistake - I have just seen this on the robots site:

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

Or create an empty "/robots.txt" file.
To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

minstrel
Oct 29th 2004, 7:17 am
No, leeds1.... the line "Disallow: " with nothing after the ":" says "Disallow nothing", i.e., "spider everything".

leeds1
Oct 29th 2004, 7:34 am
yep - saw that hence the follow up edit thingy

minstrel
Oct 29th 2004, 7:41 am
Ah, sorry, leeds1 -- I must have been posting while you were editing... when I replied that wasn't there... spooky :eek:

mim
Apr 19th 2005, 2:24 am
I have recently completed a makeover to a site which had excellent Google rating. I changed some of the page names to optimize them for keywords and used the .htaccess file to force the redirection of old pages to the new counterpart.Will this method pf redirection have any effect on the ranking?

minstrel
Apr 19th 2005, 6:33 am
No. As long as it is clear that this is a permaent redirect (301), that is exactly what Google recommends:

Google Information for Webmasters (http://www.google.com/intl/en/webmasters/4.html#A1)
3. I'm changing my URL. How can I maintain my rank?
Regrettably, we cannot manually change your listed address at the same time you move to your new site.

That said, there are steps you can take to make sure your transition is a smooth one. Google listings are based in part on our ability to find you from links on other sites. To preserve your rank, you will want to inform others who link to you of your change of address. One way to find out who is linking to you is to try a link search. Enter "link:[your full URL]" into the Google search box. You may not find every page that links to you with this method, but it should help you begin redirecting the links leading to your site. (Please note: we do not serve link queries for all of the sites in our index, so this may not produce any results for your site.) Once your new site is live, you may wish to place a permanent redirect (using a "301" code in HTTP headers) on your old site to inform visitors and search engines that your site has moved.

Finally, if your site goes unlisted for a time, this does not mean you were dropped from our index. Sometimes, in these transitions, we will fail to find a site at its new address. Just be sure that others are linking to you and we should pick you up on our next web crawl.

mim
Apr 19th 2005, 3:03 pm
Thanks for your reply, but I am unclear on where the 301 code should go and the syntax required. Should it be in the header of the .htaccess file. The pages from the old site had different names so don't exist anylonger. I first made a 404 error page which allowd them access to the new navigation, but this would be bypassed now with the redirect from the .htaccess.

noppid
Apr 19th 2005, 3:40 pm
After the rewrite rule you put [R=301] or [R=301,L] most likely.

minstrel
Apr 19th 2005, 6:47 pm
Place one of the following lines in your .htaccess file for each file that you've renamed:

Redirect 301 /oldpage.html http://www.yoursite.com/newpage.html
or

Redirect permanent /oldpage.html http://www.yoursite.com/newpage.html

Blogmaster
Apr 19th 2005, 7:11 pm
could that be the reason http://www.imgascot.com is indexed with the url as the title instead of the title tags?

minstrel
Apr 19th 2005, 7:23 pm
Are you using a redirect on that site, sitetutor?

I see this in Google: http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLD,GGLD:2005-12,GGLD:en&q=site%3Awww%2Eimgascot%2Ecom

Results 1 - 2 of 2 from www.imgascot.com for . (0.21 seconds)

SBC webhosting.com | imgascot.com
Welcome to the Future Web site of ... imgascot.com. We're pleased you've
chosen SBC's webhosting.com for your Web hosting needs! ...
www.imgascot.com/ - Similar pages

www.imgascot.com/contact_us/
Similar pages


while the meta tags say this:

<title>IMG Ascot - Indianapolis Mortgage and Loan Specialists Serving Indiana</title>
<meta name="keywords" content="Indianapolis Mortgages, Indianapolis Indiana Mortgages, Indianapolis IN Mortgages, Indianapolis Mortgages and Loans, Indianapolis Indiana Mortgages and Loans, Indianapolis IN Mortgages and Loans, Indianapolis Home Loans, Indianapolis Indiana Home Loans, Indianapolis IN Home Loans, Indianapolis Loans, Indianapolis Indiana Loans, Indianapolis IN Loans">
<meta name="description" content="Indiana mortgage, Indianapolis Mortgage Specialists Serving the State on Indiana, ">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

So the "title" Google uses, if it IS reflecting the fact that it's a redirect, is interesting -- I've never seen that before.

Blogmaster
Apr 19th 2005, 7:24 pm
Minstrel, sorry:
http://search.msn.com/results.aspx?FORM=MSNH&srch_type=0&q=indianapolis+mortgages
on msn

minstrel
Apr 19th 2005, 7:24 pm
Wait... the snippet is from an old version of the page... not the current version.

Did the page EVER have that as a title?

minstrel
Apr 19th 2005, 7:26 pm
Oh. I'm not sure what that means fopr MSN search but Google does that for newly found but not yet fully spidered sites.

How new IS this site?

DOES it use a redirect?

I'm still interested in that Google result...

noppid
Apr 19th 2005, 7:27 pm
That looks like a webhost's temp page when you activate an account. Your site seems to have been indexed before it was setup with what Minstrel shows.

Blogmaster
Apr 19th 2005, 7:29 pm
no, we took the site over as a client but it has been indexed for a few months ... it seems like there is a robot.txt file in the way

noppid
Apr 19th 2005, 7:31 pm
no, we took the site over as a client but it has been indexed for a few months ... it seems like there is a robot.txt file in the way

It is ... http://www.imgascot.com/robots.txt

# robots, nothing to see here. keep on move n...

User-agent: *

disallow /

and it's written wrong too.

minstrel
Apr 19th 2005, 7:45 pm
Hey, good find there, noppid! That robots.txt file is telling Google (and all other spiders) not to index anything on the site... assuming the bots can figure it out since, as noppid noted, the syntax is incorrect.

Change this:

# robots, nothing to see here. keep on move n...

User-agent: *

disallow /

to this

User-agent: *
Disallow:

or just delete the file.

Blogmaster
Apr 19th 2005, 9:13 pm
Thank you guys, deleted that garbage!