1. CPU troubleshooting 1. CPU troubleshooting Many of us in the Linux world have been bitten by nasty hardware problems. How many of us have set up a Linux box, installed our favorite distribution, compiled and installed some additional apps, and gotten everything working perfectly only to find that our new system has an (argh!) fatal hardware bug? Whether the symptoms are random segmentation faults, data corruption, hard locks, or lost data is irrelevant -- the hardware glitch effectively makes our normally reliable Linux operating system barely able to stay afloat. In this article, we'll take an in-depth look at how to detect flaky CPUs and RAM -- allowing you to replace the defective parts before they do some serious damage. If you're experiencing instability problems and suspect they are hardware related, I encourage you to test both your CPU and memory to ensure that they're working OK. However, even if you haven't experienced these problems, it's still a good idea to perform these CPU and memory tests. In doing so, you may detect a hardware problem that could have bitten you at an inopportune time, something that could have caused data loss or hours of frustration in a frantic search for the source of the problem. The proper, proactive application of these techniques can help you to avoid a lot of headaches, and if your system passes the tests, you'll have the peace of mind that your system is up to spec. CPU issues If you have a horribly defective CPU, your machine may be unable to boot Linux or may only run for a few minutes before locking up. CPUs in this ragged state are easy to diagnose as defective because the symptoms are so obvious. But there are more subtle CPU defects that aren't so easy to detect; generally, the less obvious errors are the ones that cause machines to either lock up every now and then for no apparent reason, or cause certain processes to die unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU -- giving it a bunch of work to do, causing it to heat up and possibly flake out. Let's look at some ways to stress-test the CPU. You may be surprised to hear that one of the best tests of CPU stability is built in to Linux -- the kernel compile. The gcc compiler is a great tool for testing general CPU stability, and a kernel build uses gcc a whole lot. By creating and running the following script from your /usr/src/linux directory, you can give your machine an industrial-strength kernel compile stress test: Code Listing 1.1: The cpubuild script #!/bin/bash make dep while [ "foo" = "foo" ] do make clean make -j2 bzImage if [ $? -ne 0 ] then echo OUCH OUCH OUCH OUCH exit 1 fi done You'll notice that this script repeatedly compiles the kernel. The reason for this is simple -- some CPUs have intermittent glitches, allowing them to compile the kernel perfectly 95% of the time, but causing the kernel compile to bomb out every now and then. Normally, this is because it may take five or more kernel compiles before the processor heats up to the point where it becomes unstable. In the above script, make sure to adjust the -j option so that the number following it is one greater than the number of CPUs in your system; in other words, use "2" for uniprocessors, "3" for dual-processors, etc. The -j option tells make to build the kernel in parallel, ensuring that there's always at least one gcc process on deck after each source file is compiled -- ensuring that the stress on your CPU is maximized. If your Linux box is going to be unused for the afternoon, go ahead and run this script, and let the machine recompile the kernel for a few hours. Possible CPU problems If the script runs perfectly for several hours, congratulations! Your CPU has passed the first test. However, it's possible that the above script dies unexpectedly. How do you know you're having a CPU problem as opposed to something else? Well, if gcc spat out an error like this, then there's a very good possibility that your CPU is defective: