Server Recovery Secrets of the Pros
Boy, when things go wrong, they go wrong fast. It seems when
servers start to come unraveled, things fall apart pretty
quickly. One minute, the server’s humming along, and the next
minute you’re up to your elbows trying to get the server to even
respond to a ctrl-alt-del.
Most server recovery is focused on hard drive recovery. That’s
fair enough, hard drive crash is the most common cause of server
failure. But with all the attention paid to hard drives, most
people ignore the failures of RAM, CPU & Motherboards, the so
called green board failures.
Green Board failures typically are caused by either power
problems or excess heat. Power problems are the silent killer.
Inadequate UPS's, or failed power supplies never seem to
give warnings. A single over voltage from utility services, and
the damage is done without an indication the damage has happened.
Excess heat is usually caused by a failed fan or data room
cooling system. Occasionally a failing fan will make high
pitched noises before complete failure, but just as often the
fan will simply stop moving air. A failure of data room cooling
is much more serious because it means possible damage to all
servers in the room.
One thing all green board failures have in common is they are
difficult to troubleshoot. With hard drive failures, it’s
usually pretty straight forward to troubleshoot the issue and
identify a course of action. With green board failures it’s
always a murky mess.
Enter Memtest86
One tool to make green board troubleshooting
simpler is Memtest86. Memtest86 was designed to test banks of
memory. But because of the tight link between memory,
motherboards and CPU’s, Memtest86 ends up being an effective
test of all three.
Memtest86 is a standalone memory test for x86 architecture
servers. It was originally designed to address the short comings
of BIOS based memory tests. BIOS tests are largely superficial
and rarely identify anything other than catastrophic memory
failure.
Memtest86 testing is based on some pretty simple concepts. Memory
devices are composed of lots of memory cells packed tightly together. Finding subtle or intermittent
errors means writing information to one area of memory, then
checking the areas around it to see if they change. If nearby
areas change, then memory is failing.
Memtest86 has nine built in tests, each designed to check
different attributes of memory. The simplest use of the program is to run
it and watch for errors. Errors are shown in flashing red
and clearly indicate green
board failure.
Of course, knowing what to do with the error report can be
somewhat difficult. It’s complicated by the fact that often
motherboard vendors don’t make it easy to identify which memory
addresses correspond to which memory banks.
In general, there are three things you can do when an error is
reported; 1) remove banks of memory, 2) rotate banks of memory
and 3) replace banks of memory. Usually, simple trial and error
will help you isolate which bank is the one causing you trouble.
One thing to keep in mind, it’s not uncommon for memory to
be incompatible with certain systems. Simply because a particular
bank doesn’t work doesn’t mean the bank is bad. You might want
to follow up and test the bank in another system as a tie
breaker.
Memtest86 once started will run until stopped. It will
automatically run through each test 1 through 8, then return
back to the first. The one test that requires manual selection
is test 9. This test is the so
called ‘Bit fade test’.
Bit Fade Test
The Bit fade test is an attempt to determine if memory will hold
its value. The test is quite simple, write something to an area
of memory, then wait 90 minutes and return and confirm the value
is still there. The test is repeated twice, and therefore takes
3 hours to run start to finish.
The Bit Fade Test is quite effective at finding memory that is
starting to go bad. If you are experiencing an unexplained crash
and feel you’ve eliminated most other causes, you might consider
taking a server down on the weekend and running this test. It’s
incredibly effective.
|
|
|
Carroll-Net Server Recovery Kit includes
Memtest86+
Every copy of the Carroll-Net Server Kit
includes Memtest86+. During boot, you can activate the memory
test by typing ‘memtest’ at the boot prompt. The program will
launch and begin testing memory in less than 2 seconds.

-
The top right corner displays the
current status.
-
Any errors detected are displayed in
red in the center of the screen.
-
You can stop testing anytime by
pressing ESC, or just turning the server off (it’s
completely safe to just turn off the server during testing).
You can download your free copy of the
Carroll-Net Server Recovery Kit with Memtest86+ at
http://www.kleobackup.net |
|