Overview of Site Indexing issues Hello there! I would appreciate any help you could give me on the following issues with a client site: www.kansassampler.com Background: dynamic site, over 8,000 URLs. Unfortunately, until a few weeks ago, their system uploaded a new set of URLs each day. In other words, for the majority of the site’s product and category/subcategory pages, the url query string would change every day. Why this was done, I have no clue. As of last week, they have resolved the issue and fixed the URLs so they don’t change. Example of problem they were/are having: the issue is that when you go into the google index and find one of their pages, one of two things will happen when you click on the listing (1)you get an error page saying the ‘id #1234 doesn’t exist’ or (2) you are taken to a page that has nothing to do with the cached page. For example run a query: site:www.kansassampler.com and view the listing that has the title of “Beverages†- on the first page, listing #9 as of today, 8/10/2007. If you view the URL you can see right away there is an issue – the URL is for some category having to do with gifts, books and CD’s (http://www.kansassampler.com/shopdisplayproducts.asp?id=205&cat=Gifts+%2FBooks+%26+CDs). If you click on the ‘cached’ link for this listing, google indicates the following: “This is Google's cache of http://www.kansassampler.com/shopdisplayproducts.asp?id=205&cat=Gifts+%2FBooks+%26+CDs as retrieved on Jul 25, 2007 01:01:04 GMT.†But the cached content looks correct – showing various beverages. Other URLs seem hit or miss. Some look okay, others don’t. The client is asking how to get rid of the bad urls in the index. I told them the following options: (1) generate a 404 error page for any urls that don’t work any more or a 301 redirect. Their problem with that is that they don’t have any record of what the urls were before they fixed them! Right now, if you check the headers when you navigate to the following old/dead url: www.kansassampler.com/shopexd.asp?id=6044 -- you will see that they are doing a 302! How we have tried to fix the problem so far: #1 – fixed the URLs so they don’t change! #2 – created an xml site map file and submitted to google and yahoo webmaster accounts (having an issue with category xml file validation – see below please) #3 – will wait to see if all are working then start on seo of the pages (which are a mess to be sure) Google site map xml validation issues with category URLs I was able to quickly create an xml site map for the product urls using google’s python tool. Over 6,000 product urls in one xml file. The product url structure was fairly simple: http://www.kansassampler.com/shopexd.asp?id=142219. However, for the category URLs, the URL structure has multiple query parameters and even some bizarre ascii characters which I believe keep causing the python script, and any xml validation I do on the python script configuration xml file to throw an error – something to the effect of: Error: Expected ; after entity name, but got = in unnamed entity at line 61 char 78. I have attached the config file to this posting. But for those who are interested. Line 61 contains the one and only category URL I keep testing: <url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU+Collection" /> When I remove the ‘+collection’ part of the URL, I still get the same error (<url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU†“ />) When I really truncate the URL to simply: ?id=2, then everything works. I cant believe I cant generate valid XML with a complex URL??? What I would love your help with: Am I taking the right steps to fix the URL indexing problem? I fixed the URLs, then submitted xml site maps. Any other thoughts? How do I remove the old URLs that are still in the index? Do we just wait? Is my idea of generating 404 errors correct? Thanks in advance for your help!!! jay from out of bounds