1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

sitemap revealed that host was blocking googlebot

Discussion in 'Google Sitemaps' started by jennyj, Jul 22, 2006.

  1. #1
    Have been using sitemaps for a few weeks, and and consistantly got an error
    "Potential Indexing Problems
    We can't currently index your site because of an unreachable error"

    Had a look through my raw log files, only to find that googlebot, msnbot and yahoo slurp all get a 403 forbidden error.
    The site works correctly for ordinary users, and has been validated a strict xhtml.
    http://www.mini-organic.co.uk


    It would appear that fasthosts (who claim to be the no. 1 hosting company in the UK) seem to be blocking all the spiders, their 'customer support' have claimed via email that they do not block any spiders - but looking through my raw log files, I can see that when googlebot, msnbot or yahoo slurp come to visit, they get a 403 forbidden error.
    Amongst the comments from 'customer support'

    "if the site works correctly for users there is no problem"

    (missing the point entirely, how will users find the site if the search engines can't index it)

    "The 403 errors are generated if there is no default document in a given folder. If you have disabled friendly http error messages in IE this will show as 'directory listing denied'."

    (again, missing the point, the file that is showing a 403 error to the spiders is index.asp, not a folder, second, it is not a user browsing the site, but a search engine spider, third, I'm pretty sure that the search enginge spiders do not browse with IE )

    "We do not block spiders from visiting our sites - most people want their sites to be picked up by spiders and we even sell a Traffic Driver service to help promote sites in search enginers so this something we would not be blocking."

    (maybe they are just trying to force their customers to use this 'traffic driver' - expensive, and claims to submit sites to '400 search engines a month' - doesn't sound like good SEO practice to me, as the major search engines tell you to submit once, and multiple submission can get you blacklisted)


    I tried some of my own testing using the tool at http://www.smart-it-consulting.com/internet/google/googlebot-spoofer/index.htm , which shows what each search engine bot 'sees' for a site.

    It returns a 403 error for each 'bot that I tested

    I then uploaded the same site to a subdirectory of a site hosted elsewhere, ran the googlebot spoofer, and found that each of the 'bots could now access the site correctly.

    I don't really want to move the hosting, as have had problems finding a host with good uptime at a reasonable cost, but if I can't get this resolved, I may have to.

    Any suggestions?

    Just glad that google sitemaps alerted me to the problem, but I now need to fix it!

    JennyJ
     
    jennyj, Jul 22, 2006 IP
  2. MaxPowers

    MaxPowers Peon

    Messages:
    261
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #2
    403 is a server generated error. Something about the server is blocking the spiders. If this isn't in one of your htaccess to specifically block the spiders, then it may be in the httpd.conf file that controls Apache startup. If you have never had access to this file (and it's not likely you have on shared hosting), then it's something setup by the server admins.

    If it isn't something they did ( I don't know that it is), then perhaps you have a scripting error throwing the 403. If the same script worked on another server, I doubt that's what it is....

    PM me your URL and I can run it through a header-sniffer I made to see the results from various user-agents.
     
    MaxPowers, Jul 23, 2006 IP
  3. arcon5

    arcon5 Peon

    Messages:
    59
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Could it be in a robots.txt file!

    What script are you using?
     
    arcon5, Jul 24, 2006 IP
  4. Jean-Luc

    Jean-Luc Peon

    Messages:
    601
    Likes Received:
    30
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Hi,

    Is this a new problem ? Because I see a few pages in Google cache dated July 14, 16, 19, ...

    Your indexed pages are all very similar. This is not good in the eyes of Google.

    Another issue is that your system includes pages for customers who want to place an order. When robots try to access these pages, your system probably detects them and refuses access (403 error).

    Jean-Luc
     
    Jean-Luc, Jul 24, 2006 IP