Server Hardware Testing and Burn-in – Detailed Stress Testing and Fault Detection on New Hardware

Go on, admit it, you’ve thought about it yourself. Wouldn’t it be satisfying to set your computer alight? Sadly, that is not what this article is about. Burning In is the term used to describe the process of testing new managed server hardware for faults before putting it to use in a live environment. This is done by running ‘Stress testing’ software for some period of time.

Whenever we get new server hardware, we always do a complete burn in to ensure that the server hardware is up to our high standards. If the hardware fails at any point, we send it back to the supplier. The actual process is easy, although setting it up isn’t.

Memory

First, when the new server is turned on, we boot off of the network, which allows us to boot multiple machines at once without needing 20+ bootable disks. The first test run is the well known Memtest, you’ll find it in Google, this thoroughly checks the computers memory, and runs for about 1 day.

If the computer passes the Memtest, it is restarted and booted into a custom Red Hat kickstart install that will install a bare Red Hat environment, and Cerberus Test Control System, special software that runs numerous tests on all the hardware in the system.

CPU

Cerberus performs several tasks to test the CPU. It compiles the Linux kernel over and over again, runs complicated mathematical problems (how long does it take you to work out if 3214235409234472020393848453 is prime?), and runs some code specifically written to run the CPU at its hottest.

Hard Drive

Cerberus writes large volumes of data to the hard drives over and over again to ensure that the drive platters are functional, and it will also delete and move files, and check the disks for errors.

If after a week the server is still running (not smoking) and hasn’t crashed, it is considered good enough for use as a production machine. If it fails the tests anywhere along the way, it is packed up and returned to be replaced. Web servers that have survived this process will certainly survive anything you can through at them.

You would normally expect that this level of testing would be completed by the hardware manufacturers and so these test shouldn’t show up any faults. In our experience testing hundreds of machines we do regularly find faults, and we do send components back.

The reason it is so important to perform this level of testing on computers that will be used as servers is that the uptime demands are so high. The slightest faults will cause outages and downtime. Once a web server is deployed, never again will you have the opportunity to take it offline and perform such detailed testing. Even if it were to crash, there is always a demand that it be put back online as quickly as possible, not left offline whilst thorough diagnostics are completed.


Author: Patrick Kelso