I was reading the docs prior to upgrading to 5.2.3, and I realized that I get to upgrade everything, os, filer fw, and disk fw..
The thing that occured to me as unfortunate was from na_disk_fw_update(1):
This command makes disks inaccessible for up to 2 minutes, so network sessions using the filer should be closed down before running it.
I've only got 4 shelves, so its not a huge deal, but I could see this having a big impact in larger environments..
Obviously the disk has to be inaccesible during the period of the fw upgrade, but disabling file service necessary? The reason I ask is that the filer can already run short a disk following a failure, so why not copy this functionality so that disk_fw updates can be done without service disruption?
....just a'wonderin'..
...kg..
Obviously the disk has to be inaccesible during the period of the fw upgrade, but disabling file service necessary?
As I remember, the reason for that is that
1) while the disk or disks in question are inaccessible, the box may not be able to service NFS or CIFS requests;
2) NFS clients can generally cope with that (unless they've done soft mounts, but soft mounts aren't a good idea if you're going to be writing to the file system - applications generally don't like getting write errors, and that's what you get on a soft-mounted file system if there's a temporary server or network or NIC outage, and they're not even that wild about getting read errors), but CIFS clients act like NFS clients with soft mounts, and get somewhat peeved if the server takes too long to respond to the request.
The reason I ask is that the filer can already run short a disk following a failure, so why not copy this functionality so that disk_fw updates can be done without service disruption?
The filer can run short a disk...
...but it's in degraded mode when it's doing that, and at risk of data loss if another disk in the RAID group fails...
...and you'd want the disk to come back when the firmware update is finished, so it'd have to reconstruct onto that disk when it's done.
We could, I guess, mark the disk whose firmware is being updated as failed before starting the firmware update (but not update the label), so that we go into degraded mode on the RAID group to which it belongs, and, when the disk firmware update is finished, turn the disk into a spare, which will start a reconstruction onto the disk if the reconstruction hasn't already started with a spare disk.
However:
1) that obviously won't work if the RAID group is *already* lacking a disk;
2) a reconstruction can be time-consuming;
3) you're still in degraded mode until the reconstruction finishes, so you're still at risk of data loss if another disk in that RAID group fails;
4) we'd either have to suppress the search for a replacement disk when we fake-fail the disk whose firmware we're updating, or have our customers live with having a spare (which may be bigger than the disk it'd replace, thus wasting the disk) taken for the reconstruction;
5) we'd also have to nuke a reconstruction - or wait for it to finish - if the disk whose firmware is to be updated is a disk onto which we're doing a reconstruction.
It *might* be possible to keep track (in NVRAM, or some other non-volatile storage) of the stripes on that disk to which you would have written had it not been out of service during the firmware update, and repair only those stripes.
This would be more complicated code than what we have now, which, of course, means the risk of buggier code.