Google, phpBB and SID

Discussion in 'Search Engine Optimization' started by serjio28, Mar 14, 2007.

  1. #1
    Hi All!

    I have launched my first website few weeks back and now I faced with one issue. First time when I have added a forum (subj) to the site I didn't take care for remove SID from links which are showed for crawlers.

    When I had got that the google coming to my site using links like : /ntopic6.html&sid=b253743ba36e2d65666f35195bf4c506 I updated code of the my forum so it doesn't deal SID for crawlers. It was about a week ago.

    But I can see that the Google bots still coming to my site using links with SID.

    I wonder is there any way to prevent Google bot using these links for access my site?

    Thanks,
     
    serjio28, Mar 14, 2007 IP
  2. skweb

    skweb Peon

    Messages:
    105
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I am not much of a techie but I also read about SIDs and then made some changes on my forum as advised on a webmaster site. Google, however, still continues to index many pages with session IDs.

    I have not seen any penalty or problem in ranking or traffic. So my advice is to simply focus on running your forum and if you have good content, you should be OK.
     
    skweb, Mar 14, 2007 IP
  3. fouadz

    fouadz Peon

    Messages:
    132
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    fouadz, Mar 14, 2007 IP
  4. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    There is one thing I am worrying about. I afraid these requests can overwheel my outbond traffic. Actually I am using cheap hosting plan with limited monthly traffic. And when I look over log file of my Apache server I can see that Google bot have called the same pages for a lot of times. Please look at the below example of log file:

    ftopic7.html&sid=22ac3c53a9f39d46eb1038667476c769
    ftopic7.html&sid=fc1f299ec40f2db9134ab44b444478ed
    ftopic7.html&sid=8065230200354f953719df245ac32995

    The same page was called for three time but with different SIDs. Called page has about of 40K size. And since Google bot tries to access each page for a lot of time I afraid it can overlimit my traffic.

    There is another little issue. I found that Google bot terrible increased view counters for each of my forum threads. So my forum looks like it has fantastically amount of visitors :D
     
    serjio28, Mar 14, 2007 IP
  5. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Thank you for the link. But I always applied mod_rewrite for my forum. And all pages which should be indexed by Google looks as static pages.

    Now I am looking a way to stop Google coming to my site with SID links. And force it to use new links of my forum which don't have SID at the end.
     
    serjio28, Mar 14, 2007 IP
  6. gravis

    gravis Peon

    Messages:
    145
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #6
    You will probably give a try using robots.txt to prevent further crawling by Google. They will continue to crawl unless specifically blocked using robots.txt or if the pages return 404 headers. May be there is a solution in phpBB forums somewhere...
     
    gravis, Mar 14, 2007 IP
  7. selectsplat

    selectsplat Well-Known Member

    Messages:
    2,559
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    190
    #7
    If you have made sure that no NEW pages will have the SID added the the URL, then eventually the problem will go away.

    See, google crawls your site in two steps. The first step is what I call 'harvesting', and the second step is 'parsing'.

    During the first step, a google bot is sent to you site, and it tries to 'harvest' every single page it can find that's on your site. it will follow every URL it comes accross. It puts all of the URLs it finids into a long list, for the 'parser' to inspect.

    Then the parser comes along and reads through the content of your site, doing whatever it is that google does. It goes through the compiled list of URLs the harvester collected one by one, reading through all of pages it has found previously. Once the parser is done, it sends in the harvester again, to make sure it didn't miss any urls.

    The problem occurs becuase google doesn't accept any cookies, so when your website attaches a SID, google thinks it has found a new URL to parse. Therefore you end up with the same page in it's list of URLs to parse more than once. For example, it's list might look like...

    www.yoursite.com/index.php
    www.yoursite.com/index.php?sid=1111
    www.yoursite.com/about.php
    www.yoursite.com/about.php?sid=1111

    And every time the parser visits to parse those pages, the bot is assigned a NEW url, so when after it parses, when it sends the harvester back in, it thinks it find NEW URLs. so now the list looks like ....

    www.yoursite.com/index.php
    www.yoursite.com/index.php?sid=1111
    www.yoursite.com/index.php?sid=2222
    www.yoursite.com/about.php
    www.yoursite.com/about.php?sid=1111
    www.yoursite.com/about.php?sid=2222

    This creates an endless loop, and eventually can cause you to exceed your bandwidth limitations.

    If you have altered your site so that it no longer gives google (or any other bot for that matter), then it will not add any new URLs to that list. however, it still has the old URLs with SIDs in it's list of URLs that it needs to parse. After a while, (probably at least one google dance; i.e. approx 3 months) google will realize that many of these pages are redundant, and it will drop the ones with the SIDs. for the time being though, you will see the parser bot visiting your site with those old SID. There's not much that can be done about it, unless you hae a compiled list of all the old sessions, and create a bunch of 301 permanent re-directs.

    The important thing, if you want your site to get back to normal, is that you verify that you are no longer giving google an SID. The easiest way to check this is by using the 'change user agent' functionality in the firefox or opera browsers. If you tell your site that you are 'google' or 'googlebot' and see if it gives you an SID. Make sure to check out all of the pages the bot should be able to reach.

    Hope this helps.

    Here's about 119 different threads in a different forums where I discussed SIDs and Google in an older ecommerce product.
    http://www.google.com/search?hl=en&...rs+sid+site:forums.oscommerce.com&btnG=Search
     
    selectsplat, Mar 14, 2007 IP
    lazyleo likes this.
  8. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Not sure that I can wait for three months :eek: What if I drop these requests with 404 error?
    I can do it using .htaccess & mod_rewrite method. So when Google bot try to access a page like this - ftopic7.html&sid=22ac3c53a9f39d46eb1038667476c769 it will get 404 error. But pages like ftopic7.html will be reached without problems.

    e.g.:
    RewriteRule ^.*sid=.*$ /error404.html [L,R=404]

    Thank you for hint. I have fixed the phpBB to applied new function for make URIs. Now if the user-agent field has any mention of GoogleBot or other bots the function doesn't add SID at the end of URL. The function was modified so it does log : UserAgent - created URL. Looking over this log I can see that when GoogleBot arrives my forum it receives SIDless urls only. But your way is best also.:)
     
    serjio28, Mar 14, 2007 IP
  9. selectsplat

    selectsplat Well-Known Member

    Messages:
    2,559
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    190
    #9
    I'm not sure this will help. If you are saying that bandwidth is the problem, then depending on your 404 page, this will still consume bandwidth, will it not?

    Even though I'm an experienced programmer, I havne't done alot with mod re-write. Can you replace the sid=.* with nothing so it just strips out the sid parameter in the query string? That would be the most benificial.

    And I'm sure you have relaized you have to make sure your are only rewriting an SID if the user_agent is google, otherwise you'll end up removing the sid for everyone.

    No matter what you do, google is still going to visit all of the URLs it has previously harvested. you can re-direct it, or mod-rewrite it, or whatever you want, but unless you're going to block google alltogether, they are still going to visit those urls. I'd let it run it's course, and get back to normal of it's own accord.
     
    selectsplat, Mar 14, 2007 IP
  10. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Let's look at the example. Now GoogleBot retrieves one page with size 35K for 10 times.
    So it overwheels my outbond traffic for 350K of useless data. In case I set 404 response for the request it will get about 1K only. So for the above example it will overwheels for 10K only. Actually we cannot do anything with inbond traffic but outbond data still in our hands.

    Yes. It is really good hint. After I had changed a function I spent few days for test my forum for various browsers and agents. And really I have fixed a lot of bugs which would disturb to work with my forum.
     
    serjio28, Mar 14, 2007 IP
  11. selectsplat

    selectsplat Well-Known Member

    Messages:
    2,559
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    190
    #11
    I think this may save you some bandwidth short term, but I'd be a little nervous that it would affect the Google listing of the underlying page (the page without the SID).

    With bandwidth as cheap as it is these days, I'm not sure risking the placement of those pages in google is worth the few extra dollars you'd have to pay to get a 100gb tranfser account with your host.
     
    selectsplat, Mar 15, 2007 IP
  12. brokensoft

    brokensoft Active Member

    Messages:
    214
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #12
    use this code in .htaccess to kill Session Id.
     
    brokensoft, Mar 17, 2007 IP
  13. mhmdkhamis

    mhmdkhamis Well-Known Member

    Messages:
    1,097
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    145
    #13
    mhmdkhamis, Mar 17, 2007 IP
  14. kapengbarako

    kapengbarako Peon

    Messages:
    914
    Likes Received:
    28
    Best Answers:
    0
    Trophy Points:
    0
    #14
    kapengbarako, Mar 18, 2007 IP
  15. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Thanks for the hint. But users who disabled cookie will be unable to login to my forum also. So it isn't clear solution for me.
     
    serjio28, Mar 19, 2007 IP
  16. seolion

    seolion Active Member

    Messages:
    1,495
    Likes Received:
    97
    Best Answers:
    0
    Trophy Points:
    90
    #16
    Guest Sessions Mod is a great one..

    I am using a similar one in my forums..
     
    seolion, Mar 19, 2007 IP
  17. serjio28

    serjio28 Peon

    Messages:
    37
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #17
    As a solution I have applied a method that prevent sid dealing to Google bot crawler only.
    Please look at the below code. It works fine for me.

    function append_sid($url, $non_html_amp = false)
    {
            global $SID,$_SERVER;;
            if ( !empty($SID) &&
                 !preg_match('#sid=#', $url) &&
                 !strstr($_SERVER['HTTP_USER_AGENT'] ,'Googlebot') &&
                 !strstr($_SERVER['HTTP_USER_AGENT'] ,'msnbot') &&
                 !strstr($_SERVER['HTTP_USER_AGENT'] ,'spider') )
            {
                    $url .= ( ( strpos($url, '?') != false ) ? ( ( $non_html_amp ) ? '&' : '&' ) : '?' ) . $SID;
    
            } 
           return $url;
    }
    
    PHP:
    So prevent SID dealing to crawlers isn't an issue. I have created the current thread only for get answer how to stop google to using old forum links which were indexed with sid postfix.
     
    serjio28, Mar 20, 2007 IP
  18. selectsplat

    selectsplat Well-Known Member

    Messages:
    2,559
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    190
    #18
    Yeah, the main problem with that code is that there are about 100 more useragents you'll want to do the same for. Here's some old osCommerce code that I helped write a few years ago that might help. it created a huge list of user-agents and put them all into an array.


    
    
    // Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE!
    
    $spiders = array("almaden.ibm.com", "abachobot", "aesop_com_spiderman", "appie 1.1", "ah-ha.com", "acme.spider", "ahoy", "iron33", "ia_archiver", "acoon", "spider.batsch.com", "crawler", "atomz", "antibot", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "wscbot", "yandex", "yellopet-spider", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "uk searcher spider", "esismartspider", "surfnomore ", "kototoi", "scrubby", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "googlebot/2.1", "henrythemiragorobot", "rabot", "pjspider", "architextspider", "henrythemiragorobot", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "nationaldirectory-webspider", "picosearch", "naverrobot", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "polybot", "xift", "nationaldirectory", "piranha", "shark", "psbot", "pinpoint", "alkalinebot", "openbot", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "fast-webcrawler", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "diibot", "nttdirectory_robot", "griffon", "geckobot", "kit-fireball", "gencrawler", "ezresult", "mantraagent", "t-rex", "mp3bot", "ip3000", "lnspiderguy", "architectspider", "steeler/1.3", "szukacz", "teoma", "maxbot.com", "bradley", "infobee", "teoma_agent1", "turnitinbot", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "crawler@fast", "crawler", "infoseek sidewinder", "lycos_spider", "fluffy the spider", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "aspider", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "borg-bot", "bright.net caching robot", "bspider", "cactvs chemistry spider", "calif", "cassandra", "digimarc marcspider/cgi", "checkbot", "christcrawler.com", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "conceptbot", "coolbot", "web core / roots", "xyleme robot", "internet cruiser robot", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "dienstspider", "digger", "digital integrity robot", "direct hit grabber", "dnabot", "download express", "dragonbot", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "eit link verifier robot", "elfinbot", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "fastcrawler", "fluid dynamics search engine robot", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "robot francoroute", "freecrawl", "funnelweb", "gammaspider", "focusedcrawler", "gazz", "gcreep", "getbot", "geturl", "golem", "googlebot", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "gulper bot", "hambot", "harvest", "havindex", "html index", "hometown spider pro", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "iajabot", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek robot 1.0", "infoseek sidewinder", "infospiders", "inspector web", "intelliagent", "i, robot", "israeli-search", "javabee", "jbot java web robot", "jcrawler", "jobo java web robot", "jobot", "joebot", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "ko_yappo_robot", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "logo.gif crawler", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mindcrawler", "mnogosearch", "momspider", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "the northstar robot", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "perlcrawler 1.0", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "portal juice spider", "pgp key agent", "plumtreewebaccessor", "poppi", "portalb spider", "psbot", "getterroboplus puu", "the python robot", "raven search", "rbse spider", "resume robot", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robocrawl spider", "robofox", "robozilla", "roverbot", "rules", "safetynet robot", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "slcrawler", "smart spider", "snooper", "solbot", "speedy spider", "spider_monkey", "spiderbot", "spiderline", "spiderman", "spiderview", "spry wizard robot", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tarspider", "tcl w3", "techbot", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "tlspider", "ucsd", "udmsearch", "url check", "url spider pro", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "vwbot", "the nwi robot", "w3m2", "wallpaper", "the world wide web wanderer", "w@pspider", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "digimarc marcspider", "webreaper", "webs", "websnarf", "webspider", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "voilabot", "sleek spider", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com ", "webcrawler");
    
    
    
    // get useragent and force to lowercase just once
    
    $useragent = strtolower(getenv("HTTP_USER_AGENT"));
    
    
    
    foreach($spiders as $Val) {
    
       if (!(strpos($Val, $useragent) === false)) {
    
         // found a spider, kill the sid/sess
    
         // Edit out one of these as necessary depending upon your version of html_output.php
    
         // $sess = NULL;
    
          $sid = NULL;
    
         break;
    
       }
    
    }
    
    // End spider stopper code
    
    PHP:
     
    selectsplat, Mar 20, 2007 IP
  19. selectsplat

    selectsplat Well-Known Member

    Messages:
    2,559
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    190
    #19
    As for
    you can use the htaccess file to remove the SID from the url once google has re-visited, and that may help Google realize that the URl is a duplicate, but you can't stop them from visiting using that old url with the SID attached that they already collected. And it's the actual visit that is costing you bandwidth.

    So there's really not much you can do except make sure it doesn't happen again, and wait it out.
     
    selectsplat, Mar 20, 2007 IP