Google, phpBB and SID

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

Hi All!

I have launched my first website few weeks back and now I faced with one issue. First time when I have added a forum (subj) to the site I didn't take care for remove SID from links which are showed for crawlers.

When I had got that the google coming to my site using links like : /ntopic6.html&sid=b253743ba36e2d65666f35195bf4c506 I updated code of the my forum so it doesn't deal SID for crawlers. It was about a week ago.

But I can see that the Google bots still coming to my site using links with SID.

I wonder is there any way to prevent Google bot using these links for access my site?

Thanks,

serjio28, Mar 14, 2007 IP

skweb Peon

Messages:: 105

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#2

I am not much of a techie but I also read about SIDs and then made some changes on my forum as advised on a webmaster site. Google, however, still continues to index many pages with session IDs.

I have not seen any penalty or problem in ranking or traffic. So my advice is to simply focus on running your forum and if you have good content, you should be OK.

skweb, Mar 14, 2007 IP

fouadz Peon

Messages:: 132

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#3

Hi,

if it really bug you , maybe you can try to rewrite the URL.

http://www.seochat.com/seo-tools/url-rewriting/

Take care !

fouadz, Mar 14, 2007 IP

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#4

skweb said: ↑

I have not seen any penalty or problem in ranking or traffic. So my advice is to simply focus on running your forum and if you have good content, you should be OK.
Click to expand...

There is one thing I am worrying about. I afraid these requests can overwheel my outbond traffic. Actually I am using cheap hosting plan with limited monthly traffic. And when I look over log file of my Apache server I can see that Google bot have called the same pages for a lot of times. Please look at the below example of log file:

ftopic7.html&sid=22ac3c53a9f39d46eb1038667476c769
ftopic7.html&sid=fc1f299ec40f2db9134ab44b444478ed
ftopic7.html&sid=8065230200354f953719df245ac32995

The same page was called for three time but with different SIDs. Called page has about of 40K size. And since Google bot tries to access each page for a lot of time I afraid it can overlimit my traffic.

There is another little issue. I found that Google bot terrible increased view counters for each of my forum threads. So my forum looks like it has fantastically amount of visitors

serjio28, Mar 14, 2007 IP

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

fouadz said: ↑

if it really bug you , maybe you can try to rewrite the URL.
Click to expand...

Thank you for the link. But I always applied mod_rewrite for my forum. And all pages which should be indexed by Google looks as static pages.

Now I am looking a way to stop Google coming to my site with SID links. And force it to use new links of my forum which don't have SID at the end.

serjio28, Mar 14, 2007 IP

gravis Peon

Messages:: 145

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#6

You will probably give a try using robots.txt to prevent further crawling by Google. They will continue to crawl unless specifically blocked using robots.txt or if the pages return 404 headers. May be there is a solution in phpBB forums somewhere...

gravis, Mar 14, 2007 IP

selectsplat Well-Known Member

Messages:: 2,559

Likes Received:: 121

Best Answers:: 0

Trophy Points:: 190

#7

If you have made sure that no NEW pages will have the SID added the the URL, then eventually the problem will go away.

See, google crawls your site in two steps. The first step is what I call 'harvesting', and the second step is 'parsing'.

During the first step, a google bot is sent to you site, and it tries to 'harvest' every single page it can find that's on your site. it will follow every URL it comes accross. It puts all of the URLs it finids into a long list, for the 'parser' to inspect.

Then the parser comes along and reads through the content of your site, doing whatever it is that google does. It goes through the compiled list of URLs the harvester collected one by one, reading through all of pages it has found previously. Once the parser is done, it sends in the harvester again, to make sure it didn't miss any urls.

The problem occurs becuase google doesn't accept any cookies, so when your website attaches a SID, google thinks it has found a new URL to parse. Therefore you end up with the same page in it's list of URLs to parse more than once. For example, it's list might look like...

www.yoursite.com/index.php
www.yoursite.com/index.php?sid=1111
www.yoursite.com/about.php
www.yoursite.com/about.php?sid=1111

And every time the parser visits to parse those pages, the bot is assigned a NEW url, so when after it parses, when it sends the harvester back in, it thinks it find NEW URLs. so now the list looks like ....

www.yoursite.com/index.php
www.yoursite.com/index.php?sid=1111
www.yoursite.com/index.php?sid=2222
www.yoursite.com/about.php
www.yoursite.com/about.php?sid=1111
www.yoursite.com/about.php?sid=2222

This creates an endless loop, and eventually can cause you to exceed your bandwidth limitations.

If you have altered your site so that it no longer gives google (or any other bot for that matter), then it will not add any new URLs to that list. however, it still has the old URLs with SIDs in it's list of URLs that it needs to parse. After a while, (probably at least one google dance; i.e. approx 3 months) google will realize that many of these pages are redundant, and it will drop the ones with the SIDs. for the time being though, you will see the parser bot visiting your site with those old SID. There's not much that can be done about it, unless you hae a compiled list of all the old sessions, and create a bunch of 301 permanent re-directs.

The important thing, if you want your site to get back to normal, is that you verify that you are no longer giving google an SID. The easiest way to check this is by using the 'change user agent' functionality in the firefox or opera browsers. If you tell your site that you are 'google' or 'googlebot' and see if it gives you an SID. Make sure to check out all of the pages the bot should be able to reach.

Hope this helps.

Here's about 119 different threads in a different forums where I discussed SIDs and Google in an older ecommerce product.
http://www.google.com/search?hl=en&...rs+sid+site:forums.oscommerce.com&btnG=Search

selectsplat, Mar 14, 2007 IP

lazyleo likes this.

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#8

selectsplat said: ↑

After a while, (probably at least one google dance; i.e. approx 3 months) google will realize that many of these pages are redundant, and it will drop the ones with the SIDs.
There's not much that can be done about it, unless you hae a compiled list of all the old sessions, and create a bunch of 301 permanent re-directs.
Click to expand...

Not sure that I can wait for three months What if I drop these requests with 404 error?
I can do it using .htaccess & mod_rewrite method. So when Google bot try to access a page like this - ftopic7.html&sid=22ac3c53a9f39d46eb1038667476c769 it will get 404 error. But pages like ftopic7.html will be reached without problems.

e.g.:
RewriteRule ^.*sid=.*$ /error404.html [L,R=404]

selectsplat said: ↑

The important thing, if you want your site to get back to normal, is that you verify that you are no longer giving google an SID. The easiest way to check this is by using the 'change user agent' functionality in the firefox or opera browsers. If you tell your site that you are 'google' or 'googlebot' and see if it gives you an SID. Make sure to check out all of the pages the bot should be able to reach.
Click to expand...

Thank you for hint. I have fixed the phpBB to applied new function for make URIs. Now if the user-agent field has any mention of GoogleBot or other bots the function doesn't add SID at the end of URL. The function was modified so it does log : UserAgent - created URL. Looking over this log I can see that when GoogleBot arrives my forum it receives SIDless urls only. But your way is best also.

serjio28, Mar 14, 2007 IP

selectsplat Well-Known Member

Messages:: 2,559

Likes Received:: 121

Best Answers:: 0

Trophy Points:: 190

#9

Not sure that I can wait for three months What if I drop these requests with 404 error?
I can do it using .htaccess & mod_rewrite method. So when Google bot try to access a page like this - ftopic7.html&sid=22ac3c53a9f39d46eb1038667476c769 it will get 404 error. But pages like ftopic7.html will be reached without problems.

e.g.:
RewriteRule ^.*sid=.*$ /error404.html [L,R=404]
Click to expand...

I'm not sure this will help. If you are saying that bandwidth is the problem, then depending on your 404 page, this will still consume bandwidth, will it not?

Even though I'm an experienced programmer, I havne't done alot with mod re-write. Can you replace the sid=.* with nothing so it just strips out the sid parameter in the query string? That would be the most benificial.

And I'm sure you have relaized you have to make sure your are only rewriting an SID if the user_agent is google, otherwise you'll end up removing the sid for everyone.

No matter what you do, google is still going to visit all of the URLs it has previously harvested. you can re-direct it, or mod-rewrite it, or whatever you want, but unless you're going to block google alltogether, they are still going to visit those urls. I'd let it run it's course, and get back to normal of it's own accord.

selectsplat, Mar 14, 2007 IP

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#10

selectsplat said: ↑

If you are saying that bandwidth is the problem, then depending on your 404 page, this will still consume bandwidth, will it not?
Click to expand...

Let's look at the example. Now GoogleBot retrieves one page with size 35K for 10 times.
So it overwheels my outbond traffic for 350K of useless data. In case I set 404 response for the request it will get about 1K only. So for the above example it will overwheels for 10K only. Actually we cannot do anything with inbond traffic but outbond data still in our hands.

selectsplat said: ↑

And I'm sure you have relaized you have to make sure your are only rewriting an SID if the user_agent is google, otherwise you'll end up removing the sid for everyone.
Click to expand...

Yes. It is really good hint. After I had changed a function I spent few days for test my forum for various browsers and agents. And really I have fixed a lot of bugs which would disturb to work with my forum.

serjio28, Mar 14, 2007 IP

selectsplat Well-Known Member

Messages:: 2,559

Likes Received:: 121

Best Answers:: 0

Trophy Points:: 190

#11

Let's look at the example. Now GoogleBot retrieves one page with size 35K for 10 times.
So it overwheels my outbond traffic for 350K of useless data. In case I set 404 response for the request it will get about 1K only. So for the above example it will overwheels for 10K only. Actually we cannot do anything with inbond traffic but outbond data still in our hands.
Click to expand...

I think this may save you some bandwidth short term, but I'd be a little nervous that it would affect the Google listing of the underlying page (the page without the SID).

With bandwidth as cheap as it is these days, I'm not sure risking the placement of those pages in google is worth the few extra dollars you'd have to pay to get a 100gb tranfser account with your host.

selectsplat, Mar 15, 2007 IP

brokensoft Active Member

Messages:: 214

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#12

<IfModule mod_php4.c>php_value session.use_only_cookies 1
php_value session.use_trans_sid 0</IfModule>
Click to expand...

use this code in .htaccess to kill Session Id.

brokensoft, Mar 17, 2007 IP

mhmdkhamis Well-Known Member

Messages:: 1,097

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 145

#13

try cyber alian mod will finish it

http://www.phpbb.com/phpBB/viewtopic.php?t=185839

mhmdkhamis, Mar 17, 2007 IP

kapengbarako Peon

Messages:: 914

Likes Received:: 28

Best Answers:: 0

Trophy Points:: 0

#14

Try the Guest Sessions Mod at www.phpbbstyles.com

kapengbarako, Mar 18, 2007 IP

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#15

brokensoft said: ↑

use this code in .htaccess to kill Session Id.
Click to expand...

Thanks for the hint. But users who disabled cookie will be unable to login to my forum also. So it isn't clear solution for me.

serjio28, Mar 19, 2007 IP

seolion Active Member

Messages:: 1,495

Likes Received:: 97

Best Answers:: 0

Trophy Points:: 90

#16

Guest Sessions Mod is a great one..

I am using a similar one in my forums..

seolion, Mar 19, 2007 IP

serjio28 Peon

Messages:: 37

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#17

kapengbarako said: ↑

Try the Guest Sessions Mod at www.phpbbstyles.com
Click to expand...

As a solution I have applied a method that prevent sid dealing to Google bot crawler only.
Please look at the below code. It works fine for me.
function append_sid($url, $non_html_amp = false)
{
        global $SID,$_SERVER;;
        if ( !empty($SID) &&
             !preg_match('#sid=#', $url) &&
             !strstr($_SERVER['HTTP_USER_AGENT'] ,'Googlebot') &&
             !strstr($_SERVER['HTTP_USER_AGENT'] ,'msnbot') &&
             !strstr($_SERVER['HTTP_USER_AGENT'] ,'spider') )
        {
                $url .= ( ( strpos($url, '?') != false ) ? ( ( $non_html_amp ) ? '&' : '&amp;' ) : '?' ) . $SID;

        } 
       return $url;
}
PHP:
So prevent SID dealing to crawlers isn't an issue. I have created the current thread only for get answer how to stop google to using old forum links which were indexed with sid postfix.

serjio28, Mar 20, 2007 IP

selectsplat Well-Known Member

Messages:: 2,559

Likes Received:: 121

Best Answers:: 0

Trophy Points:: 190

#18

Yeah, the main problem with that code is that there are about 100 more useragents you'll want to do the same for. Here's some old osCommerce code that I helped write a few years ago that might help. it created a huge list of user-agents and put them all into an array.



// Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE!

$spiders = array("almaden.ibm.com", "abachobot", "aesop_com_spiderman", "appie 1.1", "ah-ha.com", "acme.spider", "ahoy", "iron33", "ia_archiver", "acoon", "spider.batsch.com", "crawler", "atomz", "antibot", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "wscbot", "yandex", "yellopet-spider", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "uk searcher spider", "esismartspider", "surfnomore ", "kototoi", "scrubby", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "googlebot/2.1", "henrythemiragorobot", "rabot", "pjspider", "architextspider", "henrythemiragorobot", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "nationaldirectory-webspider", "picosearch", "naverrobot", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "polybot", "xift", "nationaldirectory", "piranha", "shark", "psbot", "pinpoint", "alkalinebot", "openbot", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "fast-webcrawler", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "diibot", "nttdirectory_robot", "griffon", "geckobot", "kit-fireball", "gencrawler", "ezresult", "mantraagent", "t-rex", "mp3bot", "ip3000", "lnspiderguy", "architectspider", "steeler/1.3", "szukacz", "teoma", "maxbot.com", "bradley", "infobee", "teoma_agent1", "turnitinbot", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "crawler@fast", "crawler", "infoseek sidewinder", "lycos_spider", "fluffy the spider", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "aspider", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "borg-bot", "bright.net caching robot", "bspider", "cactvs chemistry spider", "calif", "cassandra", "digimarc marcspider/cgi", "checkbot", "christcrawler.com", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "conceptbot", "coolbot", "web core / roots", "xyleme robot", "internet cruiser robot", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "dienstspider", "digger", "digital integrity robot", "direct hit grabber", "dnabot", "download express", "dragonbot", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "eit link verifier robot", "elfinbot", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "fastcrawler", "fluid dynamics search engine robot", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "robot francoroute", "freecrawl", "funnelweb", "gammaspider", "focusedcrawler", "gazz", "gcreep", "getbot", "geturl", "golem", "googlebot", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "gulper bot", "hambot", "harvest", "havindex", "html index", "hometown spider pro", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "iajabot", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek robot 1.0", "infoseek sidewinder", "infospiders", "inspector web", "intelliagent", "i, robot", "israeli-search", "javabee", "jbot java web robot", "jcrawler", "jobo java web robot", "jobot", "joebot", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "ko_yappo_robot", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "logo.gif crawler", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mindcrawler", "mnogosearch", "momspider", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "the northstar robot", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "perlcrawler 1.0", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "portal juice spider", "pgp key agent", "plumtreewebaccessor", "poppi", "portalb spider", "psbot", "getterroboplus puu", "the python robot", "raven search", "rbse spider", "resume robot", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robocrawl spider", "robofox", "robozilla", "roverbot", "rules", "safetynet robot", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "slcrawler", "smart spider", "snooper", "solbot", "speedy spider", "spider_monkey", "spiderbot", "spiderline", "spiderman", "spiderview", "spry wizard robot", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tarspider", "tcl w3", "techbot", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "tlspider", "ucsd", "udmsearch", "url check", "url spider pro", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "vwbot", "the nwi robot", "w3m2", "wallpaper", "the world wide web wanderer", "w@pspider", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "digimarc marcspider", "webreaper", "webs", "websnarf", "webspider", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "voilabot", "sleek spider", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com ", "webcrawler");



// get useragent and force to lowercase just once

$useragent = strtolower(getenv("HTTP_USER_AGENT"));



foreach($spiders as $Val) {

   if (!(strpos($Val, $useragent) === false)) {

     // found a spider, kill the sid/sess

     // Edit out one of these as necessary depending upon your version of html_output.php

     // $sess = NULL;

      $sid = NULL;

     break;

   }

}

// End spider stopper code

PHP:

selectsplat, Mar 20, 2007 IP

selectsplat Well-Known Member

Messages:: 2,559

Likes Received:: 121

Best Answers:: 0

Trophy Points:: 190

#19

As for

how to stop google to using old forum links which were indexed with sid postfix
Click to expand...

you can use the htaccess file to remove the SID from the url once google has re-visited, and that may help Google realize that the URl is a duplicate, but you can't stop them from visiting using that old url with the SID attached that they already collected. And it's the actual visit that is costing you bandwidth.

So there's really not much you can do except make sure it doesn't happen again, and wait it out.

selectsplat, Mar 20, 2007 IP

Log in or Sign up

Google, phpBB and SID

serjio28 Peon

skweb Peon

fouadz Peon

serjio28 Peon

serjio28 Peon

gravis Peon

selectsplat Well-Known Member

serjio28 Peon

selectsplat Well-Known Member

serjio28 Peon

selectsplat Well-Known Member

brokensoft Active Member

mhmdkhamis Well-Known Member

kapengbarako Peon

serjio28 Peon

seolion Active Member

serjio28 Peon

selectsplat Well-Known Member

selectsplat Well-Known Member

Useful Searches