Brian Tao asks:
BTW, could someone at Netapp tell us what exactly is done during their pre-ship burn-in testing?
I asked the folks in manufacturing what exactly we do, and here's what I learned. The actual burn-in testing for each system consists of three phases:
(1) ICT (In-Circuit Test)
Individual boards are tested in isolation -- before systems are assembled -- using special ICT test harnesses.
My understanding is that the ICT harness has special pins that directly touch traces on the board. This lets ICT test individual components to identify failures at a very low level. (This helps us to trace problems back to their root cause. Maybe a given lot of capacitors from a particular vendor is out of spec.)
(2) System Functional Testing (for 1 day)
Next the systems are assembled as per the customer's order and then tested at the system level.
First they go through a 1-day functional test that checks CPU, memory subsystem, I/O, NVRAM, and the storage subsystem. This testing includes our standard system diagnostics.
(3) Stress Testing (for 2 days)
After the functional testing, systems go into 2 days of stress testing. Originally we generated load using UNIX clients, but now we use two filers in a back-to-back configuration to test each other. They each generate a load simulating 40 clients.
All accessories (stand-alone drives, memory, NICs, etc.) are put into a filer and run through this same test process prior to shipment.
It's important to distinguish between pre-shipment burn-in and the Quality Assurance (QA) that we do as part of each new software or new hardware release.
The goal of QA is to ensure that the DESIGN is correct. The goal of burn-in is to ensure that the hardware is assembled correctly and that all componants work. QA tends to focus on software coverage, while burn-in tends to focus on hardware component coverage.
So it is in QA that we try think up nasty things to do to filers, much as you have been doing with your spare systems.
Our QA setup is actually very cool. We have a system called ANT (Automated Nightly Testing) that automatically builds our software every night, downloads it to a filer, and runs it through a series of functional tests. This gives us very quick feedback during development if something goes wrong. And of course, we're always trying to extend the automated tests, as we find clever new ways of breaking filers.
In addition, there are lots of manual tests, and longer-term stress tests that each release must go through before being released. Some of the manual testing includes things that are very difficult to automate, like pulling the power and screwing around with drives. Other tests are manual because we haven't yet gotten around to automating them.
Dave