1) Minimizing downtime was one of our major requirements. That includes downtime due to admin tasks like patch installs and OS upgrades as well as downtime due to hardware malfunction.
Fast boot time (2 minutes or so) is ensured by keeping a snapshot of the filesystem in NVRAM and journaling nfs requests that arrived since the snapshot. If the system is shut off without halting, upon bootup it will write the snapshot from NVRAM and run the journaled nfs requests guaranteeing a consistent filesystem. You don't need to run "fsck" or some other filesystem consistency checker which can take a long time to run.
Most patch installs and OS upgrades are easy: untar, download, and reboot takes all of 5 minutes.
The toaster has raid4 to protect against disk failure.
A disk is the most likely component to fail. When a disk dies the system runs in degraded mode and your filesystem is at risk from another disk failure. The toaster can be configured for a hot spare so that raid redundancy can be restored quickly and automatically.
2) Snapshots when used correctly can save alot of time and wasted effort. If your data is created and deleted and recreated etc. etc. recurrently such as doing edit/compile/run or when running simulations over and over with incremental changes then snapshots are not a good thing. They will take over all available filesystem space (caveat: depends on how often snapshots are done, the size of your filesystem, and the amount of data that is created/deleted/changed).
I restrict the data going to our toasters to that created interactively by our designers using Cadence. An IC design database may have 6 or more people working in it and if a cell disappears or the database becomes corrupted (for any of a million reasons) we only lose at most a few hours of effort by copying back a previous hourly from snapshots.
We have the following policy: "snap sched 2 6 12", 2 weeklies, 6 nightlys, and 12 hourlies.
regards, Steve Gremban gremban@ti.com
On Mon, 31 Mar 1997, Steve Gremban wrote:
- Snapshots when used correctly can save alot of time and
wasted effort. If your data is created and deleted and recreated etc. etc. recurrently such as doing edit/compile/run or when running simulations over and over with incremental changes then snapshots are not a good thing. They will take over all available filesystem space (caveat: depends on how often snapshots are done, the size of your filesystem, and the amount of data that is created/deleted/changed).
This has been a sticky problem with our F220. We currently use the default schedule in the sysadmin guide, more or less, and that means once you get close to running out of disk space you have a problem -- deleting files just shuffles them from the filesystem to .snapshot, so there's no net savings in disk space. And the .snapshot space will cheerfully spill over the 20% default. The only option then seems to be manually removing snapshots to free up disk space, or else wait 10 or so days for snapshots to finally be overwritten. (A dicey proposition with a 95+% full filer.)
I'd like to move to a snapshot schedule with more hourly shots, and few if any daily or weekly shots, so things would naturally expire faster. But the guy who does the tape backups here says it's verrryyy slloowww restoring things from exabyte dumps of the filer, so that may not be an option. Getting away from long waits for exabyte restores was one big selling point of the filer.
We have the following policy: "snap sched 2 6 12", 2 weeklies, 6 nightlys, and 12 hourlies.
Interesting. I use "snap sched 1 6 7@8,10,12,14,16,18,20". This tends to translate to about a 10 - 20% snapshot reserve depending on our filer's usage pattern... I like to keep that snapshot reserve 50 - 75% full most of the time so I have a buffer to handle large deletes gracefully when they occur. What snap reserve do you use? I am curious to hear from other admins on this issue as well.
Someone else mentioned that when both the filesystem and snapshots are full, it can be very tricky and may require manually deleteing snapshots. This is quite true. As I think the manual mentions, "preventing" a snapshot from growing could lead to strange behavior (i.e. rm *failing* due to lack of disk space). The only real workaround is to keep a snap reserve large enough for your desired schedule. If you need to, buy more disk. :)
Bruce