Hi All,
I've got a toaster onsite ready to start evaluating. If everything works out OK there's terabytes of business for NetApp at our shop.
I'm hoping the collective experience on the group can help us get up to speed with filers - the sales guys are really good about answering the questions but I suspect you're going to be a tad more impartial ;-)
Hope you don't mind the newbie questions below - if you prefer mail me directly and I can summarise back to the list.
The claims of simplicity, reliability, minimal downtime, performance for filing (compared to NT) and snapshots are what attracted me to look at the boxes - I haven't seen any horror stories in the archives - is this too good to be true? The FC-AL stuff recently looked a bit dodgey.
How many folks actualy cluster their filers? Claims of 99.997% uptime without clustering sound, once again, too good.
Is the clustering simple primary-failover or can we do n-way clusters with load sharing etc?
Is the cluster setup really a one-command 15 minute job?
User restore on their NT clients by mapping the snapshots looks a good idea. Is it usable in the real world? It would save us heaps of hassle with classic 'ask IT to do it' restores.
Any good rule of thumb sizing advice for the amount of space to reserve for snapshots?
Similarly for automated snapshot schemes, does anyone do multiple snapshots intra-day and maybe keep one for a longer period, e.g. keep a midnight snapshot for x days.
Is SnapMirror up to the job of keeping an almost real-time remote replica, i.e. snap every minute if the networks up to it? Are there any operational issues around this stuff?
Is anyone prepare to comment privately or otherwise on any recent comparisons they've done with Celera and Auspex? I understand the cost thing with EMC but loads of people seem to buy them still. This is not intended as flame bait for all the NetApp advocates.
Any advice on what we should really include in our eval to really test the box out?
...or indeed comments in general that would be useful for us.
TIA, Al
Is SnapMirror up to the job of keeping an almost real-time remote replica, i.e. snap every minute if the networks up to it? Are there any operational issues around this stuff?
This one I can answer. Keep in mind that SnapMirror is not targeted to be a "almost real-time" remote replica. Its performance depends greatly on network, load, filesystem size etc. I wouldn't expect it to be a real-time style replica. If the networks are up to it you can get 24M/sec over the wire. Snapshots are created for each transfer which also take time.
So figure out how much data changes per minute how fast your networks are and how much load you are planning on and you start to see how fast the updates can happen. I have seen situations where a per minute schedule can be achieved but I would lean more towards every 15 minutes or so but really it depends on your situation.
I think it is a great product (I am biased) but I want to make sure your expectations are set correctly.
Mike Federwisch Network Appliance
On Fri, 12 Nov 1999, Alan R. White wrote:
The claims of simplicity, reliability, minimal downtime, performance for filing (compared to NT) and snapshots are what attracted me to look at the boxes - I haven't seen any horror stories in the archives - is this too good to be true? The FC-AL stuff recently looked a bit dodgey.
I think there are definite issues surrounding the QA of LRCs (Loop Resiliency Circuits) which, well, proved to be less than resilient to failures.
How many folks actualy cluster their filers? Claims of 99.997% uptime without clustering sound, once again, too good.
All but one of our FC-AL filers are clustered AFAIK.
Is the clustering simple primary-failover or can we do n-way clusters with load sharing etc?
Yes, and no respectively. Since the two filers in the cluster will have their own filesystems they will be doing their own share, so one head will not be just sitting there. OTOH, when you do fail over you will be putting their combined load on one.
Is the cluster setup really a one-command 15 minute job?
Well, several setup commands, but 15 minutes sounds a bit long.
User restore on their NT clients by mapping the snapshots looks a good idea. Is it usable in the real world? It would save us heaps of hassle with classic 'ask IT to do it' restores.
Well, you have to educate the users. This, I think, is our biggest problem with snapshots. People who have quotas think the snapshots count against their quota.
Any good rule of thumb sizing advice for the amount of space to reserve for snapshots?
This depends on the volatility of your filesystems. For home directories with 100MB quotas and snapshots every 4 hours the default 20% is much more than enough.
Similarly for automated snapshot schemes, does anyone do multiple snapshots intra-day and maybe keep one for a longer period, e.g. keep a midnight snapshot for x days.
This depends on the purpose of the filesystem, but yes.
Is SnapMirror up to the job of keeping an almost real-time remote replica, i.e. snap every minute if the networks up to it? Are there any operational issues around this stuff?
Uggghhh, I don't know about realtime, we do it every hour. This seems to be sufficient for our needs at this time.
Is anyone prepare to comment privately or otherwise on any recent comparisons they've done with Celera and Auspex?
Hmmmm, I would say Auspex and NetApps are equally troublesome. I tend to favor NetApp for their cleaner design. I haven't played with Celera.
I understand the cost thing with EMC but loads of people seem to buy them still. This is not intended as flame bait for all the NetApp advocates.
From what I hear we've also had are share of problems with EMC.
Any advice on what we should really include in our eval to really test the box out?
If you can invest the people and time to put production level load in extreme production-like environment on all of these solution I would do so to determine which one is best for your application in your environment. If you can't invest the time, flip a coin. I think you'll be just as happy with any one of those.
...or indeed comments in general that would be useful for us.
Many UNIX bigots will tend to favor Auspex or EMC on UNIX because of their UNIX interface. I think that if you remember what kind of interface a file server has you're spending too much time with it. The promise of a dedicated NFS server is best expressed by a quote from very annoying infomercials: you should "set it and forget it." If that isn't true dedicated file servers are only as good as conventional servers.
Tom
On Fri, 12 Nov 1999, Alan R. White wrote:
I'm hoping the collective experience on the group can help us get up to speed with filers - the sales guys are really good about answering the questions but I suspect you're going to be a tad more impartial ;-)
Netapp brought by a large bank/brokerage firm the other day for a tour, so I just went through about two hours of the same kinds of questions. ;-)
The claims of simplicity, reliability, minimal downtime, performance for filing (compared to NT) and snapshots are what attracted me to look at the boxes - I haven't seen any horror stories in the archives - is this too good to be true? The FC-AL stuff recently looked a bit dodgey.
The on-board FC-AL problems with the F700's are a notable exception, which Netapp has temporarily addressed by providing a slot-based adapter (uses up another slot, but...). Otherwise, my own experiences show that the Netapps are generally as good as people say. I do have a couple colleagues who have had lots of grief with their filer hardware (one guy hates Netapp... "they keep sending us busted hardware!"), but I'd chalk that up to either bad luck or misconfigured filers.
How many folks actualy cluster their filers? Claims of 99.997% uptime without clustering sound, once again, too good.
Our very first filer, an F220 with 14 4GB SCSI drives, rolled the uptime counter in Data ONTAP 4.2a, 497 days. It ended up with about 520 days or so of uptime before we had to unplug it and move it to another cabinet. It used to hold our newsfeed (back when you could get a reasonable amount into a few dozen gigs) and still contains the data for our corporate web site, personal web pages, online registration and DNS zone files. Not an idle filer. :) It took us about 30 minutes of downtime to move the filer and shelves. Over 520 days, that works out to 99.996% uptime. If you don't consider a planned outage as "downtime", then the observed availability goes to 99.9997% (about a 2-minute outage because of a reboot induced by a parity error in the read cache). This is on a non-clustered filer.
Generally, I would say you can expect 99.999% availability (two 3-minute outages per year) once you get past the initial "bathtub curve" hardware reliability syndrome. Clustering will protect you against large-scale failures (motherboard/CPU frying, NVRAM board falling over, double power supply failure, etc.) but, ironically, not against minor panics that cause a filer to do a quick reboot. The failover takes about two minutes to complete and thus not suitable for an environment where continuous availability is required. To achieve that, your applications will have to deal with their storage medium going away for a few minutes... you cannot rely solely on the filers to provide CA.
Is the clustering simple primary-failover or can we do n-way clusters with load sharing etc?
A cluster is an active-active load-sharing pair. It would be very keen to have N+1 clustering, where each filer can failover for one of its neighbours, and the ring can be grown infinitely this way. I have a cluster of four filers that I would desperately love to have this ability (currently they are two separate clustered pairs).
Is the cluster setup really a one-command 15 minute job?
It took me less than 15 minutes the first time I clustered two filers together. There is a step-by-step guide to connecting the X/Y ServerNet interconnects, and a few commands to run along the way to make sure everything is hunky-dory. Once every is connected, of course, turning on clustering is a single command: "cf enable".
User restore on their NT clients by mapping the snapshots looks a good idea. Is it usable in the real world? It would save us heaps of hassle with classic 'ask IT to do it' restores.
Yup... this is one of the best features of ONTAP. Imagine a backup set with a couple dozen filesystems and about 7 million inodes. An reporting script just went awry and zeroed out about 8000 random RCS archive files before someone caught it. These files are scattered all around the filesystem. The last full backup was from 5 days ago, and the daily differential was taken 15 hours ago. Imagine how much work it would be to pull those files off the tapes, figure out which *,v files need to be copied back to disk and what your developers will be doing in the meantime. Then realize the best you can do are 15-hour-old versions.
If you have snapshots, you tell everyone to hold off on using RCS for the next little while, spend 5 or 10 minutes on a script that will search out the truncated files, find the most recent copy in snapshots and pull it back to the active filesystem. That script runs for about 30 minutes scouring the filesystem and exhumes RCS files that are at most an hour or two old. Your users are unproductive for about 45 minutes instead of having to come back the next day, and you've saved yourself hours of work and aggravation.
This actually happened where I worked (except with thousands of small accounting-related files instead of RCS, but with effectively the same results). It may only happen once or twice in a lifetime, but you'll be glad you have the tools to deal with it when it does.
Any good rule of thumb sizing advice for the amount of space to reserve for snapshots?
Similarly for automated snapshot schemes, does anyone do multiple snapshots intra-day and maybe keep one for a longer period, e.g. keep a midnight snapshot for x days.
Hard to say. Your snapshot reserve will depend on your schedule, and your schedule will depend on your user's habits. On our filers that hold business customer data (web pages, mail, etc.) we snapshot every other hour between 8 am and 6 pm, keeping the four most recent. Five nightly snapshots are kept (taken at midnight), and we don't bother with any weeklies (too much file turnover). On those filers, the 10% snap reserve is kept pretty full.
One caveat: the disk space set aside by "snap reserve" does *not* limit the size of your snapshots! This is probably something that should be made more clear in the man page. It only prevents the active filesystem from chewing into snapshot space, not the other way around. IOW, don't crank up the schedule to keep the last 14 days of snapshots, and assume the filer will expire old ones as it runs out of snap reserve.
Is anyone prepare to comment privately or otherwise on any recent comparisons they've done with Celera and Auspex? I understand the cost thing with EMC but loads of people seem to buy them still. This is not intended as flame bait for all the NetApp advocates.
If you ignore the price tag (hardware and post-sales support contracts), the EMC looks great on paper. I have a couple of inherited Symmetrix frames now, and from what my guys tell me, that thing is just a pain in the butt to manage. You have drives that have "hypervolumes" overlaid, arranged into 3-disk "RAID-S" sets, and then the host system is running Veritas to manage the logical volumes, etc., etc. I don't know a whole lot about the Symmetrix product yet, but it seems very messy and difficult to understand. EMC's professional services group was contracted to configure the frame, so I assume this is as good as it gets with them (perhaps an incorrect assumption). I imagine the Celerra will have many of the same underlying issues, except now you have to deal with EMC's NFS implementation as well.
Any advice on what we should really include in our eval to really test the box out?
Try the extreme cases like very large directories (10k's or 100k's of files), very deep directory hierarchies, large files (2GB+) and intense file locking activity (something that always sucks over NFS). We have one (poorly-written, IMHO) application that blew chunks on NFS because of extremely intensive locking activity.
Any advice on what we should really include in our eval to really test the box out?
I'd try to mirror/simulate as closely as possible whatever workload you expect to place on your filers when you put them into production.
Don't forget to eval whatever backup/restore sw you are looking at at the same time.
You might want to test out NA customer support. Put you filer into some situations that you need to call support for and see if support can provide you the assistance you need.
"Brian" == Brian Tao taob@risc.org writes: Brian> Try the extreme cases like very large directories (10k's or 100k's Brian> of files), very deep directory hierarchies, large files Brian> (2GB+) and intense file locking activity (something that Brian> always sucks over NFS).
It's neat to see how a filer performs under those conditions, but I expect you have a pretty good idea of the workload you plan to place on your filers. You may know that you aren't ever going to have to deal with 3GB files or 100k entry directories.
So, I'd definitly try some extreme cases that result from events beyond your control (i.e., hw failures):
Try pulling a disk and testing out RAID reconstuct. Turn the filer off in degraded mode and see if it comes back okay. If you are thinking about clustering, get a clustered pair on evel. Exercise the clustering. Try every scenarir you can think of to force a failover (e.g, turn a filer off, pull a filer's fan, break a filer's FC-AL A-loop).
Try pulling a filer's disk, then while it is doing a reconstruct, turn it off to force a takeover. See if the partner does a proper takeover and begins a reconstuct. Pull a disk on the opposite filer so that it is doing two reconstructs at once.
Try addind a shelf to a clustered pair w/o having both filers down. This is supposed to work and is a documented procedure from NA.
Just for kicks, try out some catastrophic things so you can see what they look like. Pull two disks from the same RAID group. Try turning off a shelf.
It has been my experience that NA's are great once you get them up and running if you don't touch them. If you do _anything_ out of the norm (i.e, any h/w maintenence procedure, or try to utilize any recently introduced feature, such as SnapMirror), you have a 50/50 chance of exercising some bug. Our filers have actually been _less_ reliable since we clustered them. We've had two extended downtimes (> 1 hour) when trying to take advantage of the clustering in order to perform a zero-downtime maintenence. I'm not sure I ever want to type 'cf takeover' on my filers again.
Before we clustered our F740's, we had an F540 and and F630 that never went down. We then ran an F740 for over a year with no trouble. When we finally got a pair of F740's and clustered them, we started having problems.
Good luck. They are truly wonderful devices when they work.
j. -- Jay Soffian jay@cimedia.com UNIX Systems Engineer Cox Interactive Media