On Fri, 12 Nov 1999, Alan R. White wrote:
I'm hoping the collective experience on the group can help us get up to speed with filers - the sales guys are really good about answering the questions but I suspect you're going to be a tad more impartial ;-)
Netapp brought by a large bank/brokerage firm the other day for a tour, so I just went through about two hours of the same kinds of questions. ;-)
The claims of simplicity, reliability, minimal downtime, performance for filing (compared to NT) and snapshots are what attracted me to look at the boxes - I haven't seen any horror stories in the archives - is this too good to be true? The FC-AL stuff recently looked a bit dodgey.
The on-board FC-AL problems with the F700's are a notable exception, which Netapp has temporarily addressed by providing a slot-based adapter (uses up another slot, but...). Otherwise, my own experiences show that the Netapps are generally as good as people say. I do have a couple colleagues who have had lots of grief with their filer hardware (one guy hates Netapp... "they keep sending us busted hardware!"), but I'd chalk that up to either bad luck or misconfigured filers.
How many folks actualy cluster their filers? Claims of 99.997% uptime without clustering sound, once again, too good.
Our very first filer, an F220 with 14 4GB SCSI drives, rolled the uptime counter in Data ONTAP 4.2a, 497 days. It ended up with about 520 days or so of uptime before we had to unplug it and move it to another cabinet. It used to hold our newsfeed (back when you could get a reasonable amount into a few dozen gigs) and still contains the data for our corporate web site, personal web pages, online registration and DNS zone files. Not an idle filer. :) It took us about 30 minutes of downtime to move the filer and shelves. Over 520 days, that works out to 99.996% uptime. If you don't consider a planned outage as "downtime", then the observed availability goes to 99.9997% (about a 2-minute outage because of a reboot induced by a parity error in the read cache). This is on a non-clustered filer.
Generally, I would say you can expect 99.999% availability (two 3-minute outages per year) once you get past the initial "bathtub curve" hardware reliability syndrome. Clustering will protect you against large-scale failures (motherboard/CPU frying, NVRAM board falling over, double power supply failure, etc.) but, ironically, not against minor panics that cause a filer to do a quick reboot. The failover takes about two minutes to complete and thus not suitable for an environment where continuous availability is required. To achieve that, your applications will have to deal with their storage medium going away for a few minutes... you cannot rely solely on the filers to provide CA.
Is the clustering simple primary-failover or can we do n-way clusters with load sharing etc?
A cluster is an active-active load-sharing pair. It would be very keen to have N+1 clustering, where each filer can failover for one of its neighbours, and the ring can be grown infinitely this way. I have a cluster of four filers that I would desperately love to have this ability (currently they are two separate clustered pairs).
Is the cluster setup really a one-command 15 minute job?
It took me less than 15 minutes the first time I clustered two filers together. There is a step-by-step guide to connecting the X/Y ServerNet interconnects, and a few commands to run along the way to make sure everything is hunky-dory. Once every is connected, of course, turning on clustering is a single command: "cf enable".
User restore on their NT clients by mapping the snapshots looks a good idea. Is it usable in the real world? It would save us heaps of hassle with classic 'ask IT to do it' restores.
Yup... this is one of the best features of ONTAP. Imagine a backup set with a couple dozen filesystems and about 7 million inodes. An reporting script just went awry and zeroed out about 8000 random RCS archive files before someone caught it. These files are scattered all around the filesystem. The last full backup was from 5 days ago, and the daily differential was taken 15 hours ago. Imagine how much work it would be to pull those files off the tapes, figure out which *,v files need to be copied back to disk and what your developers will be doing in the meantime. Then realize the best you can do are 15-hour-old versions.
If you have snapshots, you tell everyone to hold off on using RCS for the next little while, spend 5 or 10 minutes on a script that will search out the truncated files, find the most recent copy in snapshots and pull it back to the active filesystem. That script runs for about 30 minutes scouring the filesystem and exhumes RCS files that are at most an hour or two old. Your users are unproductive for about 45 minutes instead of having to come back the next day, and you've saved yourself hours of work and aggravation.
This actually happened where I worked (except with thousands of small accounting-related files instead of RCS, but with effectively the same results). It may only happen once or twice in a lifetime, but you'll be glad you have the tools to deal with it when it does.
Any good rule of thumb sizing advice for the amount of space to reserve for snapshots?
Similarly for automated snapshot schemes, does anyone do multiple snapshots intra-day and maybe keep one for a longer period, e.g. keep a midnight snapshot for x days.
Hard to say. Your snapshot reserve will depend on your schedule, and your schedule will depend on your user's habits. On our filers that hold business customer data (web pages, mail, etc.) we snapshot every other hour between 8 am and 6 pm, keeping the four most recent. Five nightly snapshots are kept (taken at midnight), and we don't bother with any weeklies (too much file turnover). On those filers, the 10% snap reserve is kept pretty full.
One caveat: the disk space set aside by "snap reserve" does *not* limit the size of your snapshots! This is probably something that should be made more clear in the man page. It only prevents the active filesystem from chewing into snapshot space, not the other way around. IOW, don't crank up the schedule to keep the last 14 days of snapshots, and assume the filer will expire old ones as it runs out of snap reserve.
Is anyone prepare to comment privately or otherwise on any recent comparisons they've done with Celera and Auspex? I understand the cost thing with EMC but loads of people seem to buy them still. This is not intended as flame bait for all the NetApp advocates.
If you ignore the price tag (hardware and post-sales support contracts), the EMC looks great on paper. I have a couple of inherited Symmetrix frames now, and from what my guys tell me, that thing is just a pain in the butt to manage. You have drives that have "hypervolumes" overlaid, arranged into 3-disk "RAID-S" sets, and then the host system is running Veritas to manage the logical volumes, etc., etc. I don't know a whole lot about the Symmetrix product yet, but it seems very messy and difficult to understand. EMC's professional services group was contracted to configure the frame, so I assume this is as good as it gets with them (perhaps an incorrect assumption). I imagine the Celerra will have many of the same underlying issues, except now you have to deal with EMC's NFS implementation as well.
Any advice on what we should really include in our eval to really test the box out?
Try the extreme cases like very large directories (10k's or 100k's of files), very deep directory hierarchies, large files (2GB+) and intense file locking activity (something that always sucks over NFS). We have one (poorly-written, IMHO) application that blew chunks on NFS because of extremely intensive locking activity.