New subject: Possible software cause for total data loss?

4 Nov 1997


      On 11/04/97 10:48:56 you wrote:
...
...
During the pre-production testing of one of our F230 filers, we
discovered a problem with one of them that we wereonly able to fix by
rebuilding the RAID (and thus losing whatever OS and data was on the
filer).
Be careful with precise words.  "Rebuilding the RAID" is something
that happens when a disk fails, and it doesn't cause you to lose your
data.  If you mean, say, re-initialize the filesystem, then that
makes more sense.
...
Part of the tests consisted of filling up the filesystem via NFS
and NDMP copies from a host Ultrasparc.  Three other F230's of
identical configuration survived the tests, but the remaining F230
experienced the following panic four times:
PANIC: ../common/wafl/nvlog.c: 1088: Assertion failure
I will be running the NVRAM diagnostics later today to see if they
turn up anything.  However, more distressing is the behaviour of the
Netapp upon reboot:
You are correct that it's probably a software bug, although running the
NVRAM diagnostics (and perhaps re-seating the card) is certainly
something you should try.
...
[... other boot messages deleted...]
Loading filesystem.
Recomputing parity in NVRAM
PANIC: ../driver/disk/disk.c:2633: Assertion failure.
version: NetApp Release 4.2a: Fri Sep  5 09:36:36 PDT 1997
cc flags: 3
dumping core: .......... Old core present on disk --- not dumped.
Program terminated
ok
At this point the filer is inaccessible, and I can't find a way to
get it up and running.
Why was it "innaccessible"?  Just reboot it again.
...
Is there a way to flush the NVRAM or ignore an
existing dump... some way to turn NFS back on so the data can be
retrieved.
Yes... just keep rebooting and evetually it will throw away
the NVRAM.  As for "ignore an existing dump", no I don't think
so, but that's okay... the fact that the filer can't dump core
isn't preventing you from rebooting (although I think it does
prevent the auto-reboot).
...
Booting the kernel off floppy doesn't help because it
tries to replay the WAFL logs too, and another panic occurs.
When NVRAM is corrupt, you have to keep rebooting several times.
The sequence is usually like this.
1. Filer crashes while running - Reboot
2. Filer crashes replaying NVRAM - Reboot
3. Filer crahses again while replaying NVRAM - Reboot
4. Filer realizes it's failed replaying NVRAM twice in a row, so
   it flags it as bad, dumps the NVRAM, and - Reboot
5. Filer comes back up, probably in degraded mode, and is thus
   reconstructing.  If there has been filesystem damage it may
   crash here again, and reboot again.  If you still can't get
   it up (it may say "Filesystem may be scrambled) or you can't
   get it up for any length of time, you should call Netapp
   support and have them help you with the procedure for fixing
   the filesystem (wack) from floppy.
The real kickers in this are you have to "know" that it'll do 2 and 3,
and won't just keep rebooting forever.  I think I've seen cases where
it takes more than that for it to jettison NVRAM, but I can't be
positive.  This made since from a design point of you, to only give
up if you fail to replay the NVRAM twice in a row, but in reality it
seems that with most bugs (not all) if it fails once, it'll fail again.
Furthermore, once it decides NVRAM is corrupt, it tries to dumb core
and reboot *AGAIN*.  The design thought here was again a sound one -
get a core dump so we can look at the corrupt NVRAM and figure out
what's wrong.  However, in reality, if you've gotten to this point
you've probably already crashed, and dumped core once, so you'll
never be able to see this faulty NVRAM core... at least not until
Netapp starts supporting multiple cores.
The other bad thing about this sequence is you have several crashes
stemming from the original crash, and possibly even several different
bugs, but you'll never be able to get the cores from anything but the
first one.
I think there is a way to bypass some of this by booting off floppy
and jettisoning the NVRAM manually, but given the time involved you
are probably better off just rebooting the filer again.
...
The only
way around I've found is to wipe out the filesystem and start over
again (obviously not the optimal solution).  Ideas?
The above should help.
Bruce

Re: Possible software cause for total data loss?