Hi, I have just found that using the following shouldn't be done: User-agent: * Disallow: It is a robots.txt file that allows everything, but apparently some search engines may miss read into thinking that it is banning every robot. Is this true as I have this robots.txt file like the above in alot of my directories. I read about it here: http://www.seoconsultants.com/robots-text-file/#not-recommended So you think that instead of using a robots.txt file with User-agent: * Disallow: in it, then I might aswell just not have a robots.txt file. Could some robots/ serch engines thing I don't want to them crawl my site using User-agent: * Disallow: Please advise me on this as I really want to know. Thanks!
Correct. If you aren't going to disallow anything, just use a blank robots.txt (to avoid 404) or none at all. No need to risk anything as you say.
I have just looked through Yahoo and MSN and it looks as if they have not crawled my site properly, especially Yahoo. It looks as if alot of the information in Yahoo is old stuff. So using a blank robot.txt file is ok then. I prefer to use something so that I don't have the 404 error all the time.
Consider using Google Sitemaps and Yahoo Feeds. Robots.txt is primarily to tell them what NOT to do. Sitemaps can guide them to where you want them.
Here I go again! I have a product search engine where my main aim is to sell products for merchants and not to give their site PR. Well, I use a click.php script so that I can keep a track of the clicks and also so that it can then redirect to the merchants site. I have found that this clicks.php file for every product is being indexed in the search engines, which could be bad as as soon as someone clicks the listing it redirects to the merchants site straight away, which may make it look like a doorway page or something. Well anyway, I am probably also leaking alot of PR to these merchant sites as the search engines are crawling it and passing pr to the merchants. I was thinking that if I use a robots.txt file to stop the search engines from crawling the clicks.php file then it will not be listed in the search engines plus will it also mean that I will not loose any PR as the robots will not follow the links to the merchants site?
There's a difference between not being crawled and actually showing up in the listings. When you block something in robots.txt doesn't mean the SE will deny existance of that link. It just won't use its content.
Is there anyway I can block the existance of that link then to stop them gaining PR. I need to stop them again pr as I am promoting their products and it is not meant for them to gain pr, but for me to sell their products. What about using the rel=nofollow
Yes, nofollow can block PR but the links will still show up. What you can do is 'cloak' by referrer. click.php should only be accessed from your site, so in PHP or other scripting languages you can check whether they came from your domain, if not show 404 or 301 to the homepage. That form of cloacking is allwoed because you are not discriminating betwene end users and SE bots but by referer.
Do you think I could get banned from the search engine or penelized in any way for allow the click.php script to get listed into Google? It is listed there for each product see and when someone clicks on the link it redirects straight to the merchants site and doesn't go to mine.
Yes, I notice that a lot of sites no longer use or don't even know to have a robots.txt file which will render "unfair" 404s against your site. Many bots ignore the file, but then register a 404 if they cannot locate one - the same is true for favico file; better to have than not IMO.
I doubt it. But fact is, it's a useless link so it's in their benefit for it to be removed. It might be easier to control the situation in that regard if you put click.php in a sub folder and block that folder. But I'd go with the referer checks. Also what you can do is add a token as a parameter with a simple script and only redirect valid tokens that are say under 60 seconds old. If not valid, redirect to homepage. Quite a few options for you.
The thing is it actually ranks higher than some of my main pages and I have got some sales this way just because google referred the traffic the the click.php script and then it redirected straight away to the merchants site. I have just read up that passing PR to these merchants sites don't make me loose any of my sites or webpages PR. So I don't have to worry about loosing the PR. But anyway, I am still worried about all these click.php listings being in Google. I don't want to get banned. So couldn't I just use the robots.txt file and have: Disallow: /click.php Wouldn't that just stop them listing the click.php page or does it have to be in a folder of it's own and then I put a disallow on that folder/directory? Also, do you think it is necessary to put your include files into the robots.txt file, or isn't it really needed. The includes that I am on about are the php files for connecting to the database and things.
In my experience, just blocking that file will not get them out of the SERPs. It made my title and snippets go away but the links remain.
I have just gone and blocked that file on all my sites now. Well, lets just see how it goes. I dought it if it will go just like what you said.
What happen if spiders don't obey the file? And how to test whether your robot.txt is working properly?
This is not true. There is no need to change it: your robots.txt is perfect. It allows all robots and it will be understood by all polite robots. Jean-Luc
I prefer to just leave the robot txt file blank. If there is not polite robots that will understand it then I could get de-indexed, which will then mean that I will have less traffic or none from some search engines.