anyone know a good 404 finder for your site?

Discussion in 'HTML & Website Design' started by dimmakherbs, Dec 11, 2008.

  1. #1
    id like to be able to plug in my site and have it find errors, like 404s, but most importantly HOW it got to the link, where it was or something so i can find these errors.
     
    dimmakherbs, Dec 11, 2008 IP
  2. busman3000

    busman3000 Peon

    Messages:
    543
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #2
    busman3000, Dec 12, 2008 IP
  3. kk5st

    kk5st Prominent Member

    Messages:
    3,497
    Likes Received:
    376
    Best Answers:
    29
    Trophy Points:
    335
    #3
    kk5st, Dec 12, 2008 IP
  4. 2advance

    2advance Well-Known Member

    Messages:
    2,614
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    140
    #4
    To do this, simply go to the Start button and choose the Run command. Then, insert cmd. When the command window pops up, all you have to do is to insert 'ping website name' and you will know if the server recognizes the website or if you have wrongly entered the name of the website.
     
    2advance, Dec 13, 2008 IP
  5. dimmakherbs

    dimmakherbs Active Member

    Messages:
    1,330
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    90
    #5
    Im trying to flush out the errors in my site. I am getting 404 errors on my site map and google is crawling some, but i can't figure out where they GO to get to that URL.
     
    dimmakherbs, Dec 13, 2008 IP
  6. kk5st

    kk5st Prominent Member

    Messages:
    3,497
    Likes Received:
    376
    Best Answers:
    29
    Trophy Points:
    335
    #6
    Any get that results in an error should leave a line in the error log, e.g. /var/log/apache2/error.log. It will look something like this:
    
    [Sat Dec 13 23:04:48 2008] [error] [client 192.168.1.47] File does not exist: /home/gt/public_html/some.html, referer: http://koko/~gt/test.html
    Code (markup):
    From there, you can extract the bad link address, and the page it is on.

    You could also spider your site. Use the utility wget. See wget for windows, or use your Linux package manager.

    From the command line, enter
    $ wget --spider -r http://mysite.com/
    Code (markup):
    I made a local test file for demo purposes. There are two links, one good, one not.
    gt@aretha:~$ wget --spider -r http://koko/~gt/test.html
    Spider mode enabled. Check if remote file exists.
    --2008-12-13 23:34:47--  http://koko/~gt/test.html
    Resolving koko... 192.168.1.10
    Connecting to koko|192.168.1.10|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 722 [text/html]
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2008-12-13 23:34:47--  http://koko/~gt/test.html
    Reusing existing connection to koko:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 722 [text/html]
    Saving to: `koko/~gt/test.html'
    
    100%[======================================>] 722         --.-K/s   in 0s      
    
    2008-12-13 23:34:47 (93.9 MB/s) - `koko/~gt/test.html' saved [722/722]
    
    Loading robots.txt; please ignore errors.
    --2008-12-13 23:34:47--  http://koko/robots.txt
    Reusing existing connection to koko:80.
    HTTP request sent, awaiting response... 404 Not Found
    2008-12-13 23:34:47 ERROR 404: Not Found.
    
    Removing koko/~gt/test.html.
    
    Spider mode enabled. Check if remote file exists.
    --2008-12-13 23:34:47--  http://koko/~gt/some.html
    Reusing existing connection to koko:80.
    HTTP request sent, awaiting response... 404 Not Found
    Remote file does not exist -- broken link!!!
    
    Spider mode enabled. Check if remote file exists.
    --2008-12-13 23:34:47--  http://koko/~gt/new.html
    Connecting to koko|192.168.1.10|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 463 [text/html]
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2008-12-13 23:34:47--  http://koko/~gt/new.html
    Reusing existing connection to koko:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 463 [text/html]
    Saving to: `koko/~gt/new.html'
    
    100%[======================================>] 463         --.-K/s   in 0s      
    
    2008-12-13 23:34:47 (131 MB/s) - `koko/~gt/new.html' saved [463/463]
    
    Removing koko/~gt/new.html.
    
    Found 1 broken link.
    
    http://koko/~gt/some.html
    
    FINISHED --2008-12-13 23:34:47--
    Downloaded: 2 files, 1.2K in 0s (105 MB/s)
    gt@aretha:~$ 
    Code (markup):
    I don't know how Google uses the sitemap.xml. Assuming your sitemap looks something like this:
    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>http://gtwebdev.com/</loc>
        <priority>1.0</priority>
      </url>  
      
      ...
    
    </urlset>
    Code (markup):
    Make a working copy of the unzipped xml file. Run a couple of find/replace operations so the <loc> lines,
     <loc>http://gtwebdev.com/</loc>
    Code (markup):
    looks like this:
     <a href="http://gtwebdev.com/">xxx</a>
    Code (markup):
    Then run wget again with different options.
    wget --spider --force-html -i sitemap.xml
    Code (markup):
    cheers,

    gary
     
    kk5st, Dec 13, 2008 IP