1) Minimizing downtime was one of our major requirements. That includes downtime due to admin tasks like patch installs and OS upgrades as well as downtime due to hardware malfunction.
Fast boot time (2 minutes or so) is ensured by keeping a snapshot of the filesystem in NVRAM and journaling nfs requests that arrived since the snapshot. If the system is shut off without halting, upon bootup it will write the snapshot from NVRAM and run the journaled nfs requests guaranteeing a consistent filesystem. You don't need to run "fsck" or some other filesystem consistency checker which can take a long time to run.
Most patch installs and OS upgrades are easy: untar, download, and reboot takes all of 5 minutes.
The toaster has raid4 to protect against disk failure.
A disk is the most likely component to fail. When a disk dies the system runs in degraded mode and your filesystem is at risk from another disk failure. The toaster can be configured for a hot spare so that raid redundancy can be restored quickly and automatically.
2) Snapshots when used correctly can save alot of time and wasted effort. If your data is created and deleted and recreated etc. etc. recurrently such as doing edit/compile/run or when running simulations over and over with incremental changes then snapshots are not a good thing. They will take over all available filesystem space (caveat: depends on how often snapshots are done, the size of your filesystem, and the amount of data that is created/deleted/changed).
I restrict the data going to our toasters to that created interactively by our designers using Cadence. An IC design database may have 6 or more people working in it and if a cell disappears or the database becomes corrupted (for any of a million reasons) we only lose at most a few hours of effort by copying back a previous hourly from snapshots.
We have the following policy: "snap sched 2 6 12", 2 weeklies, 6 nightlys, and 12 hourlies.
regards, Steve Gremban gremban@ti.com