Robots.txt Question

Discussion in 'Google' started by webmindz24, Aug 25, 2008.

  1. #1
    Hey All,

    I would like to know how can i block soft 404 pages to be crawled by Google to any site throgh Robots.txt (any other way if any?)

    As you all must be knowing - Soft error pages can distort the Web crawler's results. For example, instead of receiving a header that indicates a problem, the crawler receives a soft error page and the return code 200, which indicates the successful download of a valid HTML page.

    To handle this situation, we can specify options for handling soft error pages when configuring the Web crawler. The Web crawler needs the information about each Web site that returns soft error pages:

    The syntax can be like this:
    http://www.examplesite.com/hr/*


    Kindly explain the exact solution with accurate syntax for robots.txt

    Thanks

    ..
     
    webmindz24, Aug 25, 2008 IP
  2. sweetfunny

    sweetfunny Banned

    Messages:
    5,743
    Likes Received:
    467
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Robots.txt is for excluding pages, what your asking about soft/hard 404 pages really has nothing to do with Robots.txt

    You do "not" want to do soft 404's, you want to only do hard 404's and that is have a 404 response code sent by your webserver when a page isn't found instead of just doing a 301 redirect to home.

    To see if your server/site is handing it right, put an invalid URL in the Header Response Tool and if it comes back 404 you are set.
     
    sweetfunny, Aug 25, 2008 IP
  3. webmindz24

    webmindz24 Peon

    Messages:
    311
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #3

    I think you dont have enough knowledge for "Soft 404" . Soft 404 is - Instead of returning a 404 response code for a non-existent URL, websites that serve "soft 404s" return a 200 response code. The content of the 200 response is often the homepage of the site.

    So its confusing for Google crawler and for Users also.. Crawler can read it as Duplicate..
    How to solve the problem?
     
    webmindz24, Aug 25, 2008 IP
  4. sweetfunny

    sweetfunny Banned

    Messages:
    5,743
    Likes Received:
    467
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Heck i've been dealing with webservers and building sites since the 90's, i'd be stupid not to know about server response codes. I'm not the one thinking it's related to a robots.txt remember.

    Put this in your .htaccess

    ErrorDocument 404 /404.html

    Make a page called 404.html telling people Page Not Found with some useful links, a search box or whatever you want. Easy, invalid URL's will return the 404 response code.

    Happy?
     
    sweetfunny, Aug 25, 2008 IP
  5. webmindz24

    webmindz24 Peon

    Messages:
    311
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0