1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

robots.txt pointing to index page

Discussion in 'robots.txt' started by NewComputer, Sep 26, 2004.

  1. #1
    Would this be enough to get you banned for spam or for another reason? Say your site was www. mydomain .com and you had www. mydomain. com/robots.txt pointing to your index page so that if someone typed it in they would see your home page. Is this bad?
     
    NewComputer, Sep 26, 2004 IP
  2. Smyrl

    Smyrl Tomato Republic Staff

    Messages:
    13,740
    Likes Received:
    1,702
    Best Answers:
    78
    Trophy Points:
    510
    #2
    Who knows about robots.txt files other than webmasters or search engines? What would be the advantage of such?

    Shannon
     
    Smyrl, Sep 26, 2004 IP
  3. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #3
    I am thinking it was done by mistake, but when you enter the domain plus the robots.txt you see the index page.
     
    NewComputer, Sep 26, 2004 IP
  4. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #4
    No one has any input? I thought someone would have an idea here for sure....
     
    NewComputer, Sep 26, 2004 IP
  5. dazzlindonna

    dazzlindonna Peon

    Messages:
    553
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #5
    It may be confusing the bots, and in that case, then yes, it would be bad.
     
    dazzlindonna, Sep 26, 2004 IP
  6. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #6
    Thanks Donna, worst case scenario is we change it and it does not change anything.
     
    NewComputer, Sep 26, 2004 IP
  7. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #7
    That is indeed the question...

    I did participate in a forum investigation a few months back (another forum) of a website which "vanished" from Google for no obvious reason other than that the robots.txt file was a copy of the index page (or redirected to the index page). I don't know that the "funny" robots.txt file was the problem but it was the only obviously strange thing about the site.

    So go back to Smyrl's question: I don't see what you have gain by doing this and it may have a disadvantage.

    So: no real advantage -- uncertain but possible disadvantage.

    Why take the chance?
     
    minstrel, Sep 26, 2004 IP
  8. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #8
    What happened was they removed the robots.txt file, so if you type the url in you are redirected. What I am now wondering is if this would in fact cause spiders to skip off. Do they come in looking for www. mydomain.com/robots.txt or would they already be on the server and just look for the robots.txt file and when they don't find one continue to crawl.
     
    NewComputer, Sep 27, 2004 IP
  9. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #9
    I don't know -- spiders aren't very bright -- really, all they know how to do is follow links.

    On the other hand, look at that: "I don't know". And again ask, "do I want to take the risk?".

    If you no longer want a robots.txt file, for whatever reason, you can delete it or have a file that just says
    User-agent: *
    Disallow:
    Code (markup):
    That way, there is no risk.

    Also, when you say the robots.txt file "redirects" to the home page, how exactly is that done? As an htaccess redirect? or from the robots.txt file itself? Is there still a file titled "robots.txt" in the root directory?
     
    minstrel, Sep 27, 2004 IP
  10. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #10
    Thanks Minstrel,

    There is not a robots.txt file in the directory. When you type www. mydomain.com/robots.txt you are sent to what is a copy of the homepage. This is done by a mod rewrite I believe. I am not exactly sure how it is done, but you can type anything after the .com and it will take you to that page so I am assuming mod rewrite. I am not the webmaster, just trying to help out. I have let them know that they need to put the robots.txt back, but they said they removed it because it was causing one of the big three bots to hit the .txt file and then move on. They said there was nothing in there banning anything and here is what I believe it was:

    #User-agent: lycra
    #Disallow: /

    #User-agent: *
    #Disallow: /tmp
    #Disallow: /logs

    User-agent: *
    Disallow:

    So, not to sure why they would turn away from that, maybe someone here can help.
     
    NewComputer, Sep 27, 2004 IP
  11. Mel

    Mel Peon

    Messages:
    369
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Sounds to me like you may have a custom 404 page which redirects to the homepage, so when the bots come by and attempt to read the robots file instead they are served your home page, which at best is going to be confusing to them, and they will request the robots file every time they come to your site.

    There is no reason why a generic robots.txt file which allows spidering of everything would cause any spider to leave, or for that matter a blank robots.txt file.

    But since it is not a mindboggling exercise to put in a good robots.txt file which may well save you from having to entertain all the email harvesters who come by, why not set things up right?
     
    Mel, Sep 27, 2004 IP
  12. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #12
    #User-agent: lycra
    #Disallow: /
    
    #User-agent: *
    #Disallow: /tmp
    #Disallow: /logs
    
    User-agent: *
    Disallow:
    Code (markup):
    is not in the recommended order... should have been
    #User-agent: *
    #Disallow: /tmp
    #Disallow: /logs
    
    #User-agent: lycra
    #Disallow: /
    Code (markup):
    That, by the way, is telling "lycra" to go away and not have access to anything -- was that what they wanted?

    Honestly, I would tell them to delete the redirect and either correct the robots.txt file to do what they want it to do or just delete it -- the only ones getting 404 errors would be spiders and they don't care -- if they don't find an instruction to "not spider" they will spider.
     
    minstrel, Sep 27, 2004 IP
  13. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #13
    Isn't the # put there to indicate do not read?
     
    NewComputer, Sep 27, 2004 IP
  14. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #14
    Doh! Sharp eyes, there, NewComputer -- I didn't even see it.

    Yep. The "#" is a comment, so all of that should be ignored, except for
    User-agent: *
    Disallow:
    Code (markup):
    which is fine because it says "spider everything".
     
    minstrel, Sep 27, 2004 IP
  15. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #15
    yea, but this still does not explain why Yahoo! (inktomie/Slurp) is hitting the robots.txt file and then leaving and not spidering. Can anyone think of an answer?
     
    NewComputer, Sep 27, 2004 IP
  16. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Might just be waiting before they do a big crawl... They don't always to the whole lot in one go.

    If the robots file was the problem, it could take a few hours/days before they start crawling the whole site again.
     
    SEbasic, Sep 27, 2004 IP
  17. NewComputer

    NewComputer Well-Known Member

    Messages:
    2,021
    Likes Received:
    68
    Best Answers:
    0
    Trophy Points:
    188
    #17
    Thanks SE, the robots.txt file was removed. This leads me to believe that the bots look for and not find the robots.txt file, when they don't, they will see in the meta to spider everything and they should resume. We'll see. No action yet today.
     
    NewComputer, Sep 27, 2004 IP
  18. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Well, good luck on it ;)
     
    SEbasic, Sep 27, 2004 IP
  19. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #19
    1. Slurp is weird. Even if it does spider your site, it won't necessarily make a lot of difference in terms of showing up in Yahoo.

    2. Is your site updated/modified regularly? If not, it may be that Slurp comes in, checks for robots.txt exclusions, and then gets the headers looking for last modified date. If it hasn't changed, it will go away (this isn't specific to Slurp, by the way).
     
    minstrel, Sep 27, 2004 IP
  20. gullam18

    gullam18 Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #20
    :) hi all,

    my site is http://www.adps-domain.com

    and

    the below pages are not indexed by google:

    http://adps-domain.com/osComm/domain_name_registration.php

    http://www.adps-domain.com/osComm/sitemap.php


    google is indexing my top page only :confused: , it omits the deeper pages.

    is any special things to do to get indexing the deeper pages in google?

    plz need help?
    Thanks in advance.
     
    gullam18, Oct 29, 2004 IP