I am working on cleaning up the URLs at my company. We have literally thousands of URLs out there. Is there a tool that will show all of them? Yahoo Site Explorer only shows 1k but I would like to have a gigantic list.
I don't know any tools that would directly do this (although they exist), but you might want to get your hands on a "link verifier" that would crawl your pages checking the integrity of your backlinks--I used one years ago that spit out all the URLs it crawled while it did this. Good luck!
Have you tried typing site:yourdomain.com into Google? This will show you what pages they have in their index
It will not show all of them, not for a site of this size. The site: operator is hugely buggy and unreliable, forget it for large sites. Plus, there will be pages that Google has not indexed. What you need is a tool that will start from your home page and follow all the links to all the pages and report the URLs. Luckily, there is such a tool, and it's free: Xenu's Link Sleuth. You can download is here: http://home.snafu.de/tilman/xenulink.html I don't know if it will find orphaned pages (ones that are not linked from the site). Since you are cleaning up URLs, I suggest you don't simply delete any of them. Use 301 redirects to other pages (to save your link juice and PR). Don't use any other redirects: only 301s are SEO-friendly. Also recall that URLs with www. and without www. are different URLs. I hope this helps!
Thanks very much.....and I do plan on 301 a bunch of the junk URLs to the more top level domains. This is a mass cleanup of sorts
You will want to use several sources and consolidate the data into a master list. Download Xenu Link Sleuthe for free. Enter your home page URL under Check URL and watch it run. Not only will it crawl your site and itemize all URLs, it will include all image URLs, javascript URLs, CSS URLS, etc. It will show your the value for the <title> of every page. It will show you the return status (200, 404, 301, 302, etc.) so it's very useful for detecting your redirects and broken links. Once it's finished you can export all of the data to a tab separated file that you can load into Excel to work with. Use the SITE: command at all of the major engines and dump them into Excel. If you have a web analytics package like Omniture Discover then query your analytics to get a list of all unique URLs requested on your server(s) over the last X months (how ever far back you have data). Get a copy of your entire folder structure from your web server (assuming your not using a CMS) and copy it to your harddrive. Go through it folder by folder to look for pages that can still be requested but might not be linked to from your existing site. I did this for a huge commercial PR7 site when we were doing a redesign and actually found 5 other very old versions of the site that were still on their servers, still accessilbe, and still indexed w/ the search engines because they still had inbound links. I redirected them all to the pages on the new site which most closely resembled the corresponding pages on the old versions of the site. Get all of this data into Excel, sort by URL, and writ a little macro to eliminate duplicates. PS: If your URLs sometime contain query string parameters then you'll want to know about those as well. Treat each version of a URL with different query string parameter combinations as a different URL/page because that is how the search engines will see them (and often times they render different content on a site).