There have been a lot of concerns expressed on this list regarding two issues - Fibre Channel drive stability and the current Data ONTAP release 5.2.1 stability. I just wanted to take a moment to address these concerns.
Fibre Channel Drive Stability: There have certainly been some problems we faced with stability and that obviously is the source of all the concern expressed in this exchange. But quality data that we monitor tells me that the stability of FC-AL systems have been good. One indication of the quality is the initial field failure rate of FC-AL drives. It has been a maximum of 0.3% for 18G drives and a maximum of 1.19% for 9G drives. That means that less than 3 out of 1000 18G drives and less than 12 out of 1000 9G drives shipped failed within 30 days. In the course of diagnosing and addressing the issues that have come up we have found a couple of things that everyone should be aware of. The following are the things to watch out for:
1. Disk driver and firmware upgrade We discovered some problems in the FC-AL drive firmware with the implementation of the FC-AL protocol. This was corrected by newer firmware. Data ONTAP 5.1.2 or higher and drive firmware FB37 is *required* for FC-AL drive stability. A lot of people have upgraded to 5.1.2 or 5.2.1 but have not taken the time to upgrade the firmware on their disks. 2. New memory DIMMs on F760 systems with FC-AL It was discovered that the original DIMMs we were shipping systems with was causing noise on the bus and affecting FC-AL signal integrity. NetApp has launched (and concluded) an effort to proactively contact and upgrade all F760 customers with the new DIMMs that were qualified to address this problem. If you feel that you have not been contacted or know that you have not upgraded the DIMMs on a F760 FC-AL system, you should contact Support immediately. 3. Environmental temperature and system ventilation In some instances we have diagnosed a higher than normal failure rate on storage down to overtemperature conditions. This is not unique for FC-AL drives but Fibre Channel allows for denser storage pools and therefore, air flow and cooling is even more critical. Inadequate ventilation where the vents are blocked or shelves are placed such that the exhaust is being directed back into shelves can cause the drives to deteriorate over time. We have proven cases where correcting such situations has reduced the failure rates. 4. Disk seating In some cases we have found that drives not being properly seated in the shelves can cause spurious errors. This is a physical seating problem that NetApp is addressing by considering new designs in next generation shelves. But for now checking that the drives are fully seated by firmly pressing down the drives to make sure there is no play is recommended.
With these precautions and corrective actions FC-AL systems should be stable. Of course, there is always the case of a disk not operating within acceptable parameters. These require us to diagnose the problem and remove the disk to gain stability. But that is the exception. Overall, the metrics we collect indicate that FC-AL drives have been very stable.
5.2.1 Stability: The current recommended release is 5.2.1. This release has a lot of bug fixes over previous releases and metrics leads me to believe that it is a release we should feel comfortable with. Based on autosupport logs it looks like there are over 650 systems out there running 5.2.1 or a patch release on it. Patch releases are available for some of the common bugs that are known to exist on 5.2.1 and Tech Support will recommend one if a customer needs it. In another effort to improve stability Engineering is launching a process to release regularly scheduled Maintenance Releases. These will offer you the advantage of predictability (scheduled and planned releases) and thorough QA with bug fixes to address customer crashes. The primary release driver will be bug fixes and increased stability. A pareto of the problems found on 5.2.1 showed that a couple of bugs have caused the most downtime and Engineering is working on the first maintenance release on 5.2.1 so that new systems shipped out from the factory will be protected against these bugs.
If you have any questions on any of the above feel free to contact Tech Support.
Diptish Datta Director, Product Support Network Appliance