From "Michael J. Tuciarone" on Wed, 22 Mar 2000 16:54:54 PST:
Cores are not dumped directly to the file system--at the time the filer panics, you can't risk touching the file system lest you corrupt it. So the current state--the core--is dumped to a reserved area on the disk. Rather, all the reserved areas on all disks are filled up one by one with chunks of coredump, and savecore unwinds this after reboot.
This part I was aware of.
As filers have received more main memory, we began running out of reserved disk areas before the whole core was dumped. The actual
Oh. =(
ratio of memory to disk depends on memory size and disk model; the larger the ratio the more likely you'll be unable to dump the whole core. In response, we implemented the compressed core feature. Now if the filer computes there isn't enough disk space to save the entire core uncompressed, we compress the core before writing it out.
So its doesn't always compress the core during the dump? Even our filers with 56 of the 18GB drives take a noticeably longer time to dump. Understandably they have 1GB of RAM, but if the reserved areas 52 4GB disks could hold a core from 512MB of RAM on an F630, why can't 56 18GB drives hold a core from 1GB of RAM on an F760? It seems like 5 times the amount of disk got added but only twice the amount of RAM.
(The other obvious change is to change the size of the reserved disk area, but we were loath to do that as we didn't want to make changes to the disk layout. Such changes would deeply affect both reverting back to previous releases, and the migration of disks to new filer heads during an upgrade. Basically, we concluded that the data layout on the disk is sacrosanct.)
Understandably so. Thanks for not changing the size - reverting would have been horrible.
Finally, we concluded that panics were sufficiently rare events that we
Uhhh... sufficiently rare? My customer's definition of sufficiently rare downtime is none whatsoever. =)
were willing to trade off some time during compression to ensure that we got the entire core, without too badly affecting our overall availability. Of course, the very fact that we had to make tradeoffs
In the end, of course we'll spend the extra few minutes off-line to get the core so you folks can fix our problem(s). Unfortunately, this is the first time we've gotten a complete technical explination of the problem (memory to reserved disk ratio). All we have heard before was "Guess what? You don't have to gzip your cores anymore!" which, obviously didn't sit well. =)
Is there any metric we can use to know if the filer is going to compress the core or not? All our filers seem to compress all their cores.
Thanks for the in-depth response!
-- Jeff
-- ---------------------------------------------------------------------------- Jeff Krueger E-Mail: jeff@qualcomm.com NetApp File Server Lead Phone: 858-651-6709 IT Engineering and Support Fax: 858-651-6627 QUALCOMM, Incorporated Web: www.qualcomm.com