On 11/04/97 07:02:23 you wrote:
We have a bunch of filers sitting idle waiting for some back-ordered components, so I've taken the opportunity to "stress" them and see what breaks. ;-) On a quiet filer (no exports), I started with a two-disk configuration (it doesn't seem to mind having only one parity and one data disk) and hot-plugged four more drives in rapid succession without a `raid swap'.
The usual warnings on the first write to a new disk were logged, but then the filer complained about all three SCSI buses (only 9a was in use at the time). Beyond that there were no other difficulties.
Of course, you may have just been lucky. Under different circumstances or configurations it could very easily hang or crash your filer. Don't try this at home, folks (well, at least not until it's supported). You'll note that it had to reset the adapater and the bus to try and regain a sense of sanity.
Bruce
On Tue, 4 Nov 1997 sirbruce@ix.netcom.com wrote:
Of course, you may have just been lucky. Under different circumstances or configurations it could very easily hang or crash your filer. Don't try this at home, folks (well, at least not until it's supported).
Yes, definitely do not try this on a filer with real data on it. I had the opportunity to play with a handful of units before we pressed them into production, and I sleep much better knowing how the filers react to adverse conditions.
BTW, could someone at Netapp tell us what exactly is done during their pre-ship burn-in testing?
Brian Tao asks:
BTW, could someone at Netapp tell us what exactly is done during their pre-ship burn-in testing?
I asked the folks in manufacturing what exactly we do, and here's what I learned. The actual burn-in testing for each system consists of three phases:
(1) ICT (In-Circuit Test)
Individual boards are tested in isolation -- before systems are assembled -- using special ICT test harnesses.
My understanding is that the ICT harness has special pins that directly touch traces on the board. This lets ICT test individual components to identify failures at a very low level. (This helps us to trace problems back to their root cause. Maybe a given lot of capacitors from a particular vendor is out of spec.)
(2) System Functional Testing (for 1 day)
Next the systems are assembled as per the customer's order and then tested at the system level.
First they go through a 1-day functional test that checks CPU, memory subsystem, I/O, NVRAM, and the storage subsystem. This testing includes our standard system diagnostics.
(3) Stress Testing (for 2 days)
After the functional testing, systems go into 2 days of stress testing. Originally we generated load using UNIX clients, but now we use two filers in a back-to-back configuration to test each other. They each generate a load simulating 40 clients.
All accessories (stand-alone drives, memory, NICs, etc.) are put into a filer and run through this same test process prior to shipment.
It's important to distinguish between pre-shipment burn-in and the Quality Assurance (QA) that we do as part of each new software or new hardware release.
The goal of QA is to ensure that the DESIGN is correct. The goal of burn-in is to ensure that the hardware is assembled correctly and that all componants work. QA tends to focus on software coverage, while burn-in tends to focus on hardware component coverage.
So it is in QA that we try think up nasty things to do to filers, much as you have been doing with your spare systems.
Our QA setup is actually very cool. We have a system called ANT (Automated Nightly Testing) that automatically builds our software every night, downloads it to a filer, and runs it through a series of functional tests. This gives us very quick feedback during development if something goes wrong. And of course, we're always trying to extend the automated tests, as we find clever new ways of breaking filers.
In addition, there are lots of manual tests, and longer-term stress tests that each release must go through before being released. Some of the manual testing includes things that are very difficult to automate, like pulling the power and screwing around with drives. Other tests are manual because we haven't yet gotten around to automating them.
Dave