Hi. We're www.learnoutloud.com and we've done a mod rewrite on just about all our URLs. For a long time Google seemed to spider all our pages as the rewrite pages, but just recently it seems that Google has taken to spidering our site as dynamic pages. We have Google Free Site Search on our site so you can check it out by searching "7 Habits" in our search box and you'll see all dynamic results (and this is happening with most other searches). The only way you can really get dynamic pages on our site is by blocking cookies which displays URLs with added on PHP session ID: http://www.learnoutloud.com/productpage.php?cat=1&catid=&level=2&subcatid=129&id=15594&nav=B&PHPSESSID=764f891b9a8a3e01dfba6da28668bfe7 Some of the dynamic URLs in Google results have this session ID but some are just our normal dynamic links. I'm wondering how and why Google is spidering our site this way when we have done a mod rewrite and would like Google to index our rewritten pages and not dynamic ones. We're making a site map to submit to Google so hopefully that will help. I'm just worried that the reason that not all our site is being indexed is due to Google's spidering our site as dynamic. Any ideas as to why this might be happening?
Hi, Google spiders all URL's that it knows of. If Google finds somewhere (in another site, in a forum,...) a URL like /productpage.php?cat=1&catid=&level=2&...&PHPSESSID=764f891b9a8a3e01dfba6da28668bfe7, it will visit it and from there it will discover other URL's with the same session ID. A site map will not help for this. A way to avoid the problem is to disallow access to these pages in a robots.txt file. User-agent: * Disallow: /productpage.php? Code (markup): This robots.txt disallow access to all URL's starting with /productpage.php?. Jean-Luc
if google was indexing fine as dynamic, why did you change to static? Because I dont think it will have SERP effect, it 's all about if it indexes or not
And for any of you with similar problems in that Googlebot likes to spider your dynamic WordPress URLs instead of your mod_rewriteen URLs ... User-agent: * Disallow: /?p Code (markup):
Blocking google from accessing the dynamic pages is not the solution. You should use htaccess to redirect your dynamic urls to the static url so google ends up at the right place. Problem solved.
This is the wrong way to do it. If you block googlebot from viewing the pages then it will never find out they have moved. End result is a bunch of old established pages removed from the index and google has to spider the site to find the new pages.
Thanks for the replies. So we should not block Googlebot from these URLS the way Jean-Luc has suggested? We don't really want to redirect the dynamic URLs to rewritten ones unless we could just program the spiders to be redirected. Currently for people who block cookies the site switches over dynamic URLs so that we can put their PHP session ID in the URL they can maintain their session and shop and do other session necessary activity on our site. I received another suggestion making Google and other search engines not spider dynamic URLs: i would also recommend using a session killer for spiders, to prevent sessions being set for visiting spiders. you can do this by creating an array of all the known spider names, or part of, and run a check on the user agent, if the useragent is a match then allow them to bypass the session and of course show your rewritten urls if that's how you want google to index your web site. Do you think this would work?
Why do you need to pass the session ID in the url anyway? PHP allows session management and tracking without putting the session in the url and it makes it a whole lot simpler and safer to just show the same content to your users and the search engines. If google decides to spider your site anonymously then it might index 2 copies of each page. The best way to do this is not use session ids in the urls and to redirect your old dynamic urls to the new urls using either htaccess or php.
The only time we switch over to dynamic URLs and tack on the PHP session ID is if the user is browsing the site with cookies blocked. Otherwise they'll only get rewritten URLs with no PHP session ID in the URL as that is accessed through their session cookie. Is there a way to maintain the session of a user who has cookies blocked without putting it in the URL? Because we want users who block cookies to be able to shop our site and maintain their shopping cart session.
PHP sessions can be tracked without client side cookies and without using them in the url. http://uk.php.net/session