Dear Jeff:
You said--
Incidentally, have you noticed that it now takes about five times longer to dump a core now that they are .nz compressed? This is a very frustrating "feature". The savecore process works while the filer is ON-LINE, whereas the dump process works while the filer is OFF-LINE.
If the goal is for the filer to be ON-LINE more than it is OFF-LINE, why delay the time it spends dumping with compressing a core when we could all gzip it when the filer is ON-LINE? Or why not put the compression code in the savecore process which can be executed after the filer is back ON-LINE?
The problem lies in the implementation of core dump.
Cores are not dumped directly to the file system--at the time the filer panics, you can't risk touching the file system lest you corrupt it. So the current state--the core--is dumped to a reserved area on the disk. Rather, all the reserved areas on all disks are filled up one by one with chunks of coredump, and savecore unwinds this after reboot.
As filers have received more main memory, we began running out of reserved disk areas before the whole core was dumped. The actual ratio of memory to disk depends on memory size and disk model; the larger the ratio the more likely you'll be unable to dump the whole core. In response, we implemented the compressed core feature. Now if the filer computes there isn't enough disk space to save the entire core uncompressed, we compress the core before writing it out.
(The other obvious change is to change the size of the reserved disk area, but we were loath to do that as we didn't want to make changes to the disk layout. Such changes would deeply affect both reverting back to previous releases, and the migration of disks to new filer heads during an upgrade. Basically, we concluded that the data layout on the disk is sacrosanct.)
Finally, we concluded that panics were sufficiently rare events that we were willing to trade off some time during compression to ensure that we got the entire core, without too badly affecting our overall availability. Of course, the very fact that we had to make tradeoffs means that some customers in some configurations would see some degradation. We are constantly looking for ways to improve our self-diagnostic capability. In fact, the guy in the next office is looking at core dumps right now, and I'll make sure he has read your note.
Yours, Mike Tuciarone Platform Software