Following some very nasty experiences with Veritas (vxvm & vxfs) on one of our systems, we are currently in paranoia mode about our ability to restore from backups, and have been doing more than usually extensive tests.
I have come across a serious problem with ONTAP restore, but I am in communication with NetApp about that and it's not the subject of this message. Maybe later on ... :-)
NetApp have said on a number of occasions that it's a design feature of their dump format that it is a compatible extension of good old BSD dump format, so that if all else fails one can feed them to a BSD-type restore program (losing ACLs and such-like info, of course). Solaris "ufsrestore" is usually explicitly or implicitly mentioned.
So I have been testing that, and have fallen over a problem that I knew about (at least since May 2000, as I see I mentioned it in passing on toasters then) but have been ignoring. Maybe it's time to do something about it! What happens is that ufsrestore says
write error extracting inode NNNNN, name ./path/name/to/file write: Bad address
and gives up. One can see that it has half-written the file involved. It seems to happen only on files that have holes (often many of them) at odd multiples of 4K: in our case they are usually *.pag files that are part of (n)dbm databases.
Now I very strongly suspect that this is a bug in Solaris ufsrestore, not in the NetApp dump contents, and I would like to be able to report it to Sun and get it fixed. But it seems to be very difficult to reproduce with a small example: it's far from the case that every file with oddly-aligned holes causes a problem, or even that a file with exactly the same hole pattern will provoke the bug if it occurs at a different point in the dump.
Even worse, it apparently depends on what sort of filing system ufsrestore is restoring into: I have never had it happen if it is a local ufs filing system, but often if it was an nfs one (usually on a NetApp filer, of course). If all these variables are reproduced, though, the effect is repeatable.
It's possible that Sun would reject such a bug report unless one could show that it failed on a dump generated by Solaris ufsdump. Solaris ufs filing systems have always been blocked at 8K by default (so that hole boundaries must be on multiples of 8K). On UltraSPARC (sun4u) systems one can't even mount ones that are blocked at 4K any longer. I tried making a ufsdump of a filing system blocked at 4K on a SPARCstation 5 (sun4m) - sometimes it's useful to have such an out of date machine as one's personal workstation! - with a suitably holey file in it, but I couldn't get the ufsrestore bug to show up... :-(
If anyone else has ever come across this problem, and/or has any suggestions on how to proceed with homing in on the bug, I would very much like to hear from them. The latest experiments were done with Solaris 8 ufsrestore as patched by 109091-05, but as I said above I believe the bug has been there for many years.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QH, Phone: +44 1223 334715 United Kingdom.
On 29 August I wrote
Following some very nasty experiences with Veritas (vxvm & vxfs) on one of our systems, we are currently in paranoia mode about our ability to restore from backups, and have been doing more than usually extensive tests.
I have come across a serious problem with ONTAP restore, but I am in communication with NetApp about that and it's not the subject of this message. Maybe later on ... :-)
Well, maybe it's time to reveal a little more about that.
Our problem was bug 76695 (we weren't the first to discover it). Starting with 6.2, this introduced a bug into restore which stops the *third* level of an incremental restore ever working, if you have a reasonably active filing system - or at any rate, that was our experience. (The first increment on top of the level 0 leaves the restore_symboltable file inconsistent with reality.)
This bug is not yet fixed in the recently announced 6.2.1R1 or 6.3 releases. There is a 6.2.1Dx release which fixes it, and I am sure NetApp will tell you about if you ask nicely. That is what we are running at the moment. It still has a problem with incrementally restoring dumps involving more than one qtree (unless you use the Q option to forget the division into qtrees).
As regards the topic of my original toasters posting - problems restoring NetApp dumps with Solaris ufsrestore - I haven't had any feedback on that at all. Even negative feedback, e.g. "we do this all the time and it's never given us any trouble", would be welcome.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QH, Phone: +44 1223 334715 United Kingdom.