On 10/10/99 02:59:52 you wrote:
"alexei" == alexei alexei@mindspring.net writes:
this rather distressing. The filer really needs a way to perform a filesystem health check w/o downtime.
alexei> Like running fsck on a mounted filesystem? Some things are alexei> better done in a quiesced state...
The filer is not a Unix host serving NFS. I expect more out of it. I didn't say it needed to correct errors while serving content, I said it needed to be able to do a health-check. If you go back to my original message, I mentioned that if it required an immutable filesystem to do this, then you should be able to tag a filesystem (not just an export) read only.
The raid scrubbing is indeed meant to be such a check, although it is not filesystem-based.
You can btw, run fsck on a mounted filesystem. Solaris happily runs 'fsck -n' on a mounted file system (yes, I know, it will also happily run 'rm -rf /' which doesn't mean you should do it - lot's of rope and whatnot). Linux will run e2fsck after making some noise. I don't see any reason why this would be dangerous on a filesystem mounted read only.
The problem is that with a changing filesystem, such programs could easily report a problem when in fact there is none. There are some ways around this.
Personally, while I think this should be on Netapp's agenda, there are more important things as well. Wack has been improved and now runs much faster than before. You should expect some downtime to happen when problems occur; having parity inconsistencies is *not* a normal occurrance and should not happen often.
Bruce
"sirbruce" == sirbruce sirbruce@ix.netcom.com writes:
sirbruce> The problem is that with a changing filesystem, such sirbruce> programs could easily report a problem when in fact sirbruce> there is none. There are some ways around this.
Huh? If the filesystem is made immutable, it isn't a changing filesystem. e.g, on a Unix host, this _should_ be safe:
unmount filesystem. mount filesystem read-only. run fsck on filesystem. remount filesystem read-write.
sirbruce> Personally, while I think this should be on Netapp's sirbruce> agenda, there are more important things as well. Wack sirbruce> has been improved and now runs much faster than before.
Agreed. wackz from 5.2.3D1 (what NA had me run) completed on the filer in question in ~ 15 minutes. This is an F740. The filesystem checked is 105GB, composed of three raid-groups (5+1, 5+1, 4+1), 1067656 inodes used of 3651436.
BTW - no errors were found by wackz, in spite of the 212 parity errors corrected a week earlier.
sirbruce> You should expect some downtime to happen when problems sirbruce> occur; having parity inconsistencies is *not* a normal sirbruce> occurrance and should not happen often.
Why should I expect downtime? A failed disk is a problem, but it doesn't cause downtime. A failed power-supply is a problem, but it also doesn't cause downtime. A failed head is a problem, but in a cluster, no downtime (well, 60 seconds downtime). NA has designed the filer to stay up in the face of these problems. So if NA has a check list of problems and it is working its way down the check list to keep filers up in the face of these problems, then "file-system health-check and fix" needs to be added to that list. Sure, it isn't a common occurance, but clearly it happens often enough for NA to have written wack and constantly improved it over the years. I'm arguing that the next improvement is to allow wack to be run on an on-line filer.
j.
On Sun, 10 Oct 1999, Jay Soffian wrote:
BTW - no errors were found by wackz, in spite of the 212 parity errors corrected a week earlier.
This means that your "superblocks" escaped corruption. There could still be corrupted data in the filer.
I always wondered how NA determines that the data is good and the parity is wrong. Obviously if you don't know where the corruption occured you can't correct it, but the odds of parity being wrong are much smaller than that of the data. How does NA "correct" parity errors?
Tom