On 11/04/97 10:48:56 you wrote:
During the pre-production testing of one of our F230 filers, we
discovered a problem with one of them that we wereonly able to fix by rebuilding the RAID (and thus losing whatever OS and data was on the filer).
Be careful with precise words. "Rebuilding the RAID" is something that happens when a disk fails, and it doesn't cause you to lose your data. If you mean, say, re-initialize the filesystem, then that makes more sense.
Part of the tests consisted of filling up the filesystem via NFS and NDMP copies from a host Ultrasparc. Three other F230's of identical configuration survived the tests, but the remaining F230 experienced the following panic four times:
PANIC: ../common/wafl/nvlog.c: 1088: Assertion failure
I will be running the NVRAM diagnostics later today to see if they turn up anything. However, more distressing is the behaviour of the Netapp upon reboot:
You are correct that it's probably a software bug, although running the NVRAM diagnostics (and perhaps re-seating the card) is certainly something you should try.
[... other boot messages deleted...] Loading filesystem. Recomputing parity in NVRAM
PANIC: ../driver/disk/disk.c:2633: Assertion failure.
version: NetApp Release 4.2a: Fri Sep 5 09:36:36 PDT 1997 cc flags: 3 dumping core: .......... Old core present on disk --- not dumped. Program terminated ok
At this point the filer is inaccessible, and I can't find a way to get it up and running.
Why was it "innaccessible"? Just reboot it again.
Is there a way to flush the NVRAM or ignore an existing dump... some way to turn NFS back on so the data can be retrieved.
Yes... just keep rebooting and evetually it will throw away the NVRAM. As for "ignore an existing dump", no I don't think so, but that's okay... the fact that the filer can't dump core isn't preventing you from rebooting (although I think it does prevent the auto-reboot).
Booting the kernel off floppy doesn't help because it tries to replay the WAFL logs too, and another panic occurs.
When NVRAM is corrupt, you have to keep rebooting several times. The sequence is usually like this.
1. Filer crashes while running - Reboot 2. Filer crashes replaying NVRAM - Reboot 3. Filer crahses again while replaying NVRAM - Reboot 4. Filer realizes it's failed replaying NVRAM twice in a row, so it flags it as bad, dumps the NVRAM, and - Reboot 5. Filer comes back up, probably in degraded mode, and is thus reconstructing. If there has been filesystem damage it may crash here again, and reboot again. If you still can't get it up (it may say "Filesystem may be scrambled) or you can't get it up for any length of time, you should call Netapp support and have them help you with the procedure for fixing the filesystem (wack) from floppy.
The real kickers in this are you have to "know" that it'll do 2 and 3, and won't just keep rebooting forever. I think I've seen cases where it takes more than that for it to jettison NVRAM, but I can't be positive. This made since from a design point of you, to only give up if you fail to replay the NVRAM twice in a row, but in reality it seems that with most bugs (not all) if it fails once, it'll fail again. Furthermore, once it decides NVRAM is corrupt, it tries to dumb core and reboot *AGAIN*. The design thought here was again a sound one - get a core dump so we can look at the corrupt NVRAM and figure out what's wrong. However, in reality, if you've gotten to this point you've probably already crashed, and dumped core once, so you'll never be able to see this faulty NVRAM core... at least not until Netapp starts supporting multiple cores.
The other bad thing about this sequence is you have several crashes stemming from the original crash, and possibly even several different bugs, but you'll never be able to get the cores from anything but the first one.
I think there is a way to bypass some of this by booting off floppy and jettisoning the NVRAM manually, but given the time involved you are probably better off just rebooting the filer again.
The only way around I've found is to wipe out the filesystem and start over again (obviously not the optimal solution). Ideas?
The above should help.
Bruce
On Tue, 4 Nov 1997 sirbruce@ix.netcom.com wrote:
You are correct that it's probably a software bug, although running the NVRAM diagnostics (and perhaps re-seating the card) is certainly something you should try.
The diagnostics don't indicate a problem with the NVRAM subsystem or any other component (after running in a loop for a few hours). I don't think it is purely a software bug though, given that I have three other identical filers that have yet to crash during testing, while the problem filer crashed four times. I'll try running the tests again tonight, and if the panic still occurs, I'll swap the NVRAM (board and all) with one of our spares.
[... other boot messages deleted...] Loading filesystem. Recomputing parity in NVRAM
PANIC: ../driver/disk/disk.c:2633: Assertion failure.
version: NetApp Release 4.2a: Fri Sep 5 09:36:36 PDT 1997 cc flags: 3 dumping core: .......... Old core present on disk --- not dumped. Program terminated ok
At this point the filer is inaccessible, and I can't find a way to get it up and running.
Why was it "innaccessible"? Just reboot it again.
And again it panics. It never gets up to the point where I can login to it via console or network. Powering off doesn't solve the problem, so I suspect something bogus in the NVRAM is triggering the software fault.
Is there a way to flush the NVRAM or ignore an existing dump... some way to turn NFS back on so the data can be retrieved.
Yes... just keep rebooting and evetually it will throw away the NVRAM.
Hrm, that doesn't sound like a very reliable way of doing it. ;-) Apparently there is a hidden command from floppy boot that lets you zap the NVRAM (you then have to run wacky, but at least you're back up and running).
When NVRAM is corrupt, you have to keep rebooting several times. The sequence is usually like this.
- Filer crashes while running - Reboot
- Filer crashes replaying NVRAM - Reboot
- Filer crahses again while replaying NVRAM - Reboot
- Filer realizes it's failed replaying NVRAM twice in a row, so it flags it as bad, dumps the NVRAM, and - Reboot
- Filer comes back up, probably in degraded mode, and is thus reconstructing. If there has been filesystem damage it may crash here again, and reboot again. If you still can't get it up (it may say "Filesystem may be scrambled) or you can't get it up for any length of time, you should call Netapp support and have them help you with the procedure for fixing the filesystem (wack) from floppy.
The real kickers in this are you have to "know" that it'll do 2 and 3, and won't just keep rebooting forever.
I'll give this a try if the problem persists. I recall floppy booting the filer twice after the initial crash and failed auto- reboot, but not a third time.
The only way around I've found is to wipe out the filesystem and start over again (obviously not the optimal solution). Ideas?
The above should help.
Excellent, thanks.
On Tue, 4 Nov 1997 sirbruce@ix.netcom.com wrote:
- Filer crashes while running - Reboot
- Filer crashes replaying NVRAM - Reboot
- Filer crahses again while replaying NVRAM - Reboot
- Filer realizes it's failed replaying NVRAM twice in a row, so it flags it as bad, dumps the NVRAM, and - Reboot
After replacing all the disks with new ones, the same crash popped up again on that filer, after two more days of heavy reads and writes. I rebooted it ten times in a row... kernel panic every time on "../driver/disk/disk.c:2633: Assertion failure" right after the "Recomputing parity in NVRAM" message (I assume that means it never gets around to replaying the WAFL logs?).
Call Netapp tech support, was told an on-call engineer would call back... haven't heard anything back yet. *sigh*