Thanks Chris. Actually, we have seen this on our F840's as well, which have 36 Gb disks, but mostly on the 18's. We have a script that gets the disk count from the MIB's and compares that to what the total should be - if there is a discrepancy, we know what's happened. There won't be a discrepancy if the disk had failed as it should because it will still be counted in the total.
-----Original Message----- From: Chris Blackmor [] Sent: Tuesday, April 23, 2002 9:47 AM To: Sam Schorr Cc: Server Team; toasters Subject: Re: disappearing disks
Sam, If these are 18G drives they have a known failure problem where they just spin down. I have lost 6 in the past month (but with 22 filers I am not concerned - eat all you want... they'll make more). Anyway, with FC disks you can run (as root) the following command and if you see an XXX in any of the disk spots the drive is screwed and needs to be replaced.
remsh m5 fcadmin device_map
Loop Map for channel 0a: Translated Map: Port Count 1 7 Shelf mapping:
Loop Map for channel 1: Translated Map: Port Count 57 7 0 1 2 3 8 9 10 11 16 17 18 19 24 25 26 27 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 60 61 62 52 53 54 44 45 46 36 37 38 28 29 30 20 21 22 12 13 14 4 5 6 Shelf mapping: Shelf 0: 6 5 4 3 2 1 0 Shelf 1: 14 13 12 11 10 9 8 Shelf 2: 22 21 20 19 18 17 16 Shelf 3: 30 XXX 28 27 26 25 24 Shelf 4: 38 37 36 35 34 33 32 Shelf 5: 46 45 44 43 42 41 40 Shelf 6: 54 53 52 51 50 49 48 Shelf 7: 62 61 60 59 58 57 56
On Mon, Apr 22, 2002 at 08:24:49AM -0700, Sam Schorr wrote:
We are running 6.1R1 and we have seen a frequent number of disks suddenly disappear without "failing" in the sense that the filer marks the disk as "failed". What we see is the disk is noticed as missing, there is a message to that effect in /etc/messages, but the disk does not show as failed in sysconfig -r nor does it show in sysconfig -d. If the disk is pulled and a new disk inserted, a "failed" message appears first, then the new disk is added as a spare. The original disk that went missing did have its data spared out.
We have had to write our own monitoring scripts to pull disk counts from the MIB's so that this condition can be noticed right away. Netapp support says that this "never happens" but we have the /etc/messages files to show the problem. It may be fixed in 6.2?
-----Original Message----- From: Geoff Hardin [] Sent: Monday, April 22, 2002 6:50 AM To: toasters Subject: disappearing disks
Fellow toasters; I have an F760 cluster running NetApp Release 6.1R1P1. In the past week, we have "lost" two disks on separate shelves. The disks seem to disappear from the filer and do not show up. All the disks are Seagate ST318203FC 18GB drives with firmware NA10. I've seen this happen before on spare disks, and the first disk we lost this week was a spare. Typically, the spare fails and just stops reporting; if you slip it into a different slot the disk reports as failed. No big deal, just a spare failing and the filer doesn't know what to do with it immediately. But yesterday, we lost a data disk; it actually didn't show up in the weekly cluster notification log that runs at midnight. Around 2pm we received a disk fail alert for the drive and a disk/shelf miscount error. While checking on this, I noticed a third disk, another spare, had "disappeared"; however, once the volume rebuild completed, this disk "reappeared." I was wondering if anyone else had seen similar behavior on their filers? Like I said, this is a cluster, and it's partner was still able to see the third disk that "disappeared", which leads me to believe I have an FC-AL adapter failing. All three disks have been on separate shelves, which also leads me away from suspecting an LRC (that would be too easy). Before I go tearing into the filer though, I wanted to see if anyone else had experience with this problem.
Geoff Hardin
"A one-question geek test: Seen on a California license plate on a VW Beetle: 'FEATURE'..." - Joshua D. Wachs
+-- "Sam Schorr" once said: | Thanks Chris. Actually, we have seen this on our F840's as well, which have |>36 Gb disks, but mostly on the 18's. We have a script that gets the disk cou |>nt from the MIB's and compares that to what the total should be - if there is |> a discrepancy, we know what's happened. There won't be a discrepancy if the |> disk had failed as it should because it will still be counted in the total.
We actually had this happen with a 9GB drive on a 740 when it was powered down and back up once (months ago). I just figured it was a one-time thing (and this thing was running an ancient version of the OS), but hearing this from you all now...well, I'm a little concerned about the 760s and 840s we have in production.
Any chance someone want to share any scripts they have written to look for this - it'd be helpful as a starting point at least.
This happens all the time. Older disks are more prone to this. I have a script that runs every Sunday and checks for errors in the messages files (any disk with more than 5 errors in a week gets manually failed) and then it checks the "fcadmin device_map" for "XXX". If that shows it flags it. Nothing special but it is something to let me know if a disk has spun down. This, for obvious reasons, doesn't work on scsi shelves (but I only have 8 of those left and I am working on removing them from service now.
You could do what Sam mentioned with the disk counts too. Either way will work. C-
On Tue, Apr 23, 2002 at 03:56:43PM -0400, Ozzie Sabina wrote:
+-- "Sam Schorr" once said: | Thanks Chris. Actually, we have seen this on our F840's as well, which have |>36 Gb disks, but mostly on the 18's. We have a script that gets the disk cou |>nt from the MIB's and compares that to what the total should be - if there is |> a discrepancy, we know what's happened. There won't be a discrepancy if the |> disk had failed as it should because it will still be counted in the total.
We actually had this happen with a 9GB drive on a 740 when it was powered down and back up once (months ago). I just figured it was a one-time thing (and this thing was running an ancient version of the OS), but hearing this from you all now...well, I'm a little concerned about the 760s and 840s we have in production.
Any chance someone want to share any scripts they have written to look for this - it'd be helpful as a starting point at least.