id like to be able to plug in my site and have it find errors, like 404s, but most importantly HOW it got to the link, where it was or something so i can find these errors.
you can always have a custom 404 page. I do this and it saves alot of visitors. Heres a guide: http://www.thesitewizard.com/archive/custom404.shtml
To do this, simply go to the Start button and choose the Run command. Then, insert cmd. When the command window pops up, all you have to do is to insert 'ping website name' and you will know if the server recognizes the website or if you have wrongly entered the name of the website.
Im trying to flush out the errors in my site. I am getting 404 errors on my site map and google is crawling some, but i can't figure out where they GO to get to that URL.
Any get that results in an error should leave a line in the error log, e.g. /var/log/apache2/error.log. It will look something like this: [Sat Dec 13 23:04:48 2008] [error] [client 192.168.1.47] File does not exist: /home/gt/public_html/some.html, referer: http://koko/~gt/test.html Code (markup): From there, you can extract the bad link address, and the page it is on. You could also spider your site. Use the utility wget. See wget for windows, or use your Linux package manager. From the command line, enter $ wget --spider -r http://mysite.com/ Code (markup): I made a local test file for demo purposes. There are two links, one good, one not. gt@aretha:~$ wget --spider -r http://koko/~gt/test.html Spider mode enabled. Check if remote file exists. --2008-12-13 23:34:47-- http://koko/~gt/test.html Resolving koko... 192.168.1.10 Connecting to koko|192.168.1.10|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 722 [text/html] Remote file exists and could contain links to other resources -- retrieving. --2008-12-13 23:34:47-- http://koko/~gt/test.html Reusing existing connection to koko:80. HTTP request sent, awaiting response... 200 OK Length: 722 [text/html] Saving to: `koko/~gt/test.html' 100%[======================================>] 722 --.-K/s in 0s 2008-12-13 23:34:47 (93.9 MB/s) - `koko/~gt/test.html' saved [722/722] Loading robots.txt; please ignore errors. --2008-12-13 23:34:47-- http://koko/robots.txt Reusing existing connection to koko:80. HTTP request sent, awaiting response... 404 Not Found 2008-12-13 23:34:47 ERROR 404: Not Found. Removing koko/~gt/test.html. Spider mode enabled. Check if remote file exists. --2008-12-13 23:34:47-- http://koko/~gt/some.html Reusing existing connection to koko:80. HTTP request sent, awaiting response... 404 Not Found Remote file does not exist -- broken link!!! Spider mode enabled. Check if remote file exists. --2008-12-13 23:34:47-- http://koko/~gt/new.html Connecting to koko|192.168.1.10|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 463 [text/html] Remote file exists and could contain links to other resources -- retrieving. --2008-12-13 23:34:47-- http://koko/~gt/new.html Reusing existing connection to koko:80. HTTP request sent, awaiting response... 200 OK Length: 463 [text/html] Saving to: `koko/~gt/new.html' 100%[======================================>] 463 --.-K/s in 0s 2008-12-13 23:34:47 (131 MB/s) - `koko/~gt/new.html' saved [463/463] Removing koko/~gt/new.html. Found 1 broken link. http://koko/~gt/some.html FINISHED --2008-12-13 23:34:47-- Downloaded: 2 files, 1.2K in 0s (105 MB/s) gt@aretha:~$ Code (markup): I don't know how Google uses the sitemap.xml. Assuming your sitemap looks something like this: <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://gtwebdev.com/</loc> <priority>1.0</priority> </url> ... </urlset> Code (markup): Make a working copy of the unzipped xml file. Run a couple of find/replace operations so the <loc> lines, <loc>http://gtwebdev.com/</loc> Code (markup): looks like this: <a href="http://gtwebdev.com/">xxx</a> Code (markup): Then run wget again with different options. wget --spider --force-html -i sitemap.xml Code (markup): cheers, gary