Can we get advice on how we are attempting to do our sitemaps Mod Rewrite .htaccess RewriteEngine on #RewriteRule ^~(.*)$ /profile.php?profile=$1 [L] RewriteRule ^([a-zA-Z0-9]*).html detail.php?siteid=$1 Our site works like this http://www.swapshop.co.nz/classifieds2/detail.php?siteid=1295 we want to remove the ?siteid=1295 and make to 1295.html Our scripts creates four files Rewritten http://www.swapshop.co.nz/sitemap.txt http://www.swapshop.co.nz/sitemap.xml Orginal http://www.swapshop.co.nz/sitemap2.txt http://www.swapshop.co.nz/sitemap2.xml So what we do is this Mod rewrite is working then this should go from http://www.swapshop.co.nz/classifieds2/detail.php?siteid=1295 to http://www.swapshop.co.nz/classifieds2/1295.html Which it does example here: www.swapshop.co.nz/classifieds2/1295.html So at the moment we have the following sitemaps sitemap.txt Mod rewitten txt file sitemap.xml Mod rewitten XML file sitemap2.txt Orginal txt file sitemap2.xml Orginal XML We currently submit sitemap.xml Mod rewitten XML file Can some one check this is make sure we have it right? Ie is the format ok as google was complaining about a blank line at the start which is now removed. sitemap.xml is the complete list of current adverts as this is a classifieds site. we had a issue where the host rewrote our .htaccess to std and lost a lot of hits from google. Google site maps report this issue. Question is it better to submit a txt or XML file The idea of the Mod rewrite from details.php?site=666 to 666.htm is to remove the query string. We will get if to retrieve the advert catergory instead of advert number once we work this issue out. Our issue is checking web stats I see a huge amount of direct hits to details.php with out the query string See here http://www.swapshop.co.nz/classifieds2/detail.php This page is just to tell us no query was listed. This page only had contact email details and if a engine hit this url it would get lost. We have now put a link to the site here to allow the engines to continue. Is it possbile to put the sitemap here whether sitemap.xml or a html map? Other things we do are RewriteRule ([a-zA-Z0-9]*)\.htm$ http://www.swapshop.co.nz/classifieds2/detail.php?siteid=$1 RewriteCond %{HTTP_HOST} ^swapshop\.co.nz RewriteRule ^(.*)$ http://www.swapshop.co.nz/$1 [R=301,L] Robots.txt block google and only google to the deatils.php where google gets the info from the submitted site map. Details.php and Index.php strip sessions out. <? // Use this to start a session only if the UA is *not* at search engine $searchengines=array("Google", "Fast", "Slurp", "Ink", "Atomz", "Scooter", "Crawler", "MSNbot", "Poodle", "Genius"); $is_search_engine=0; foreach($searchengines as $key => $val) { //if(strstr("$HTTP_USER_AGENT", $val)) { if(strstr($_SERVER['HTTP_USER_AGENT'], $val)) { $is_search_engine++; } } if($is_search_engine==0) { // visitor is not a search engine - start the session ini_set("session.save_handler", "files"); session_start(); //You can put anything else in here that needs to be hidden from search engines } else { // visitor is a search engine - Put anything you want only a search engine to see in here } Thanks in advance
This is a confusing post to follow, but the one thing I see is that you are rewriting your URLs to hide the ugly version and replace i twith a prettier version... In your mod_rewrite rules, you keep using [L] which forces the client (browser, Google) to change the URL... This works the way you intend when switching your domain from the non-www. version to the www. version, but it's nearly pointless to 'hide' ugly URLs then add the [L] to the rewrite rules... it sets the URL back to the ugly version. This may explain why Google has visited the ugly versioned URL so often. Perhaps a better way to get around your sessionIDs is to send them a sitemap without the sessionIDs in the URLs. It looks as if your script at the end is setup to show different content to SE's than to human visitors which may get you blacklisted. Your script is checking against 'Google' but some variations of the 'googlebot' use a lower case G and your script will miss it. AutoMapIt lets you setup sessionID's as an 'ignore' parameter... The URL will be listed in your sitemap, but the sessionID will be removed from your query string. Forgive me if I misunderstood what you are asking, but I was confused reading this at times.
I've done something similar on my web site. The trick is to make all the old urls completely disappear from the site. Do this by redirecting any requests for details.php to the rewritten URLs: RewriteRule ^detail.php?siteid=([a-zA-Z0-9]*) http://www_domain_com/path/$1 [R=301] Code (markup): The 301 redirect is important to tell the bots that the page has moved permanently. They will gradually get the hint. You could return 404 error for the requests for detail.php without a siteid. I would dump your old sitemaps as well as your txt ones. Hope this helps... Cryo.
I was wondering where this thread went... nevermind my last comments, I was completely wrong about the L versus the r=301. That was a drunken post that has ached in the back of my head since I left it Sure wish the edit button that post still worked...