Re: Considering filers for major deployment....

14 Nov 1999


      On Fri, 12 Nov 1999, Alan R. White wrote:
...
I'm hoping the collective experience on the group can help us get up
to speed with filers - the sales guys are really good about
answering the questions but I suspect you're going to be a tad more
impartial ;-)
Netapp brought by a large bank/brokerage firm the other day for a
tour, so I just went through about two hours of the same kinds of
questions.  ;-)
...
The claims of simplicity, reliability, minimal downtime, performance
for filing (compared to NT) and snapshots are what attracted me to
look at the boxes - I haven't seen any horror stories in the
archives - is this too good to be true? The FC-AL stuff recently
looked a bit dodgey.
The on-board FC-AL problems with the F700's are a notable
exception, which Netapp has temporarily addressed by providing a
slot-based adapter (uses up another slot, but...).  Otherwise, my own
experiences show that the Netapps are generally as good as people say.
I do have a couple colleagues who have had lots of grief with their
filer hardware (one guy hates Netapp... "they keep sending us busted
hardware!"), but I'd chalk that up to either bad luck or misconfigured
filers.
...
How many folks actualy cluster their filers? Claims of 99.997%
uptime without clustering sound, once again, too good.
Our very first filer, an F220 with 14 4GB SCSI drives, rolled the
uptime counter in Data ONTAP 4.2a, 497 days.  It ended up with about
520 days or so of uptime before we had to unplug it and move it to
another cabinet.  It used to hold our newsfeed (back when you could
get a reasonable amount into a few dozen gigs) and still contains the
data for our corporate web site, personal web pages, online
registration and DNS zone files.  Not an idle filer.  :)  It took us
about 30 minutes of downtime to move the filer and shelves.  Over 520
days, that works out to 99.996% uptime.  If you don't consider a
planned outage as "downtime", then the observed availability goes to
99.9997% (about a 2-minute outage because of a reboot induced by a
parity error in the read cache).  This is on a non-clustered filer.
Generally, I would say you can expect 99.999% availability (two
3-minute outages per year) once you get past the initial "bathtub
curve" hardware reliability syndrome.  Clustering will protect you
against large-scale failures (motherboard/CPU frying, NVRAM board
falling over, double power supply failure, etc.) but, ironically, not
against minor panics that cause a filer to do a quick reboot.  The
failover takes about two minutes to complete and thus not suitable for
an environment where continuous availability is required.  To achieve
that, your applications will have to deal with their storage medium
going away for a few minutes... you cannot rely solely on the filers
to provide CA.
...
Is the clustering simple primary-failover or can we do n-way
clusters with load sharing etc?
A cluster is an active-active load-sharing pair.  It would be very
keen to have N+1 clustering, where each filer can failover for one of
its neighbours, and the ring can be grown infinitely this way.  I have
a cluster of four filers that I would desperately love to have this
ability (currently they are two separate clustered pairs).
...
Is the cluster setup really a one-command 15 minute job?
It took me less than 15 minutes the first time I clustered two
filers together.  There is a step-by-step guide to connecting the X/Y
ServerNet interconnects, and a few commands to run along the way to
make sure everything is hunky-dory.  Once every is connected, of
course, turning on clustering is a single command:  "cf enable".
...
User restore on their NT clients by mapping the snapshots looks a
good idea.  Is it usable in the real world? It would save us heaps
of hassle with classic 'ask IT to do it' restores.
Yup... this is one of the best features of ONTAP.  Imagine a
backup set with a couple dozen filesystems and about 7 million inodes.
An reporting script just went awry and zeroed out about 8000 random
RCS archive files before someone caught it.  These files are scattered
all around the filesystem.  The last full backup was from 5 days ago,
and the daily differential was taken 15 hours ago.  Imagine how much
work it would be to pull those files off the tapes, figure out which
*,v files need to be copied back to disk and what your developers will
be doing in the meantime.  Then realize the best you can do are
15-hour-old versions.
If you have snapshots, you tell everyone to hold off on using RCS
for the next little while, spend 5 or 10 minutes on a script that will
search out the truncated files, find the most recent copy in snapshots
and pull it back to the active filesystem.  That script runs for about
30 minutes scouring the filesystem and exhumes RCS files that are at
most an hour or two old.  Your users are unproductive for about 45
minutes instead of having to come back the next day, and you've saved
yourself hours of work and aggravation.
This actually happened where I worked (except with thousands of
small accounting-related files instead of RCS, but with effectively
the same results).  It may only happen once or twice in a lifetime,
but you'll be glad you have the tools to deal with it when it does.
...
Any good rule of thumb sizing advice for the amount of space to
reserve for snapshots?
Similarly for automated snapshot schemes, does anyone do multiple
snapshots intra-day and maybe keep one for a longer period, e.g.
keep a midnight snapshot for x days.
Hard to say.  Your snapshot reserve will depend on your schedule,
and your schedule will depend on your user's habits.  On our filers
that hold business customer data (web pages, mail, etc.) we snapshot
every other hour between 8 am and 6 pm, keeping the four most recent.
Five nightly snapshots are kept (taken at midnight), and we don't
bother with any weeklies (too much file turnover).  On those filers,
the 10% snap reserve is kept pretty full.
One caveat:  the disk space set aside by "snap reserve" does *not*
limit the size of your snapshots!  This is probably something that
should be made more clear in the man page.  It only prevents the
active filesystem from chewing into snapshot space, not the other way
around.  IOW, don't crank up the schedule to keep the last 14 days of
snapshots, and assume the filer will expire old ones as it runs out of
snap reserve.
...
Is anyone prepare to comment privately or otherwise on any recent
comparisons they've done with Celera and Auspex? I understand the
cost thing with EMC but loads of people seem to buy them still. This
is not intended as flame bait for all the NetApp advocates.
If you ignore the price tag (hardware and post-sales support
contracts), the EMC looks great on paper.  I have a couple of
inherited Symmetrix frames now, and from what my guys tell me, that
thing is just a pain in the butt to manage.  You have drives that have
"hypervolumes" overlaid, arranged into 3-disk "RAID-S" sets, and then
the host system is running Veritas to manage the logical volumes,
etc., etc.  I don't know a whole lot about the Symmetrix product yet,
but it seems very messy and difficult to understand.  EMC's
professional services group was contracted to configure the frame, so
I assume this is as good as it gets with them (perhaps an incorrect
assumption).  I imagine the Celerra will have many of the same
underlying issues, except now you have to deal with EMC's NFS
implementation as well.
...
Any advice on what we should really include in our eval to really
test the box out?
Try the extreme cases like very large directories (10k's or 100k's
of files), very deep directory hierarchies, large files (2GB+) and
intense file locking activity (something that always sucks over NFS).
We have one (poorly-written, IMHO) application that blew chunks on NFS
because of extremely intensive locking activity.
-- 
Brian Tao (BT300, taob@risc.org)
"Though this be madness, yet there is method in't"

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Considering filers for major deployment....