New subject: Disk drive technology

24 Feb 2000


      On Wed 23 Feb, 2000, "Mohler, Jeff" jeff.mohler@netapp.com wrote:
...
I dont have any solid data on data/stripe/drivesize performance handy, but it is
a wise move to not span raid-groups over multiple FCAL controllers.  There will
be write degredation involved with that that I have seen.  With two F760s
rebuilding a manually failed drive while doing nothing else, the machine without
a stripe split across FCAL cards completed about 25% before the machine that did
span FCAL cards for a raid stripe.
mkfile tests from Solaris confirmed what Id heard..and saw in the reconstruct as
part of my test.
This strikes me as very interesting. Until now I'd have reasoned that
having two times the bandwidth to the spindles, and the lower latency
that implies as well, (having only a proportion of io's in flight on each
channel getting in the way of subsequent io's), would have benefitted
RAID rebuilding - which is the nastiest case where all stripes must be
read, XOR'd-over and blocks written to one disk.
Can anyone explain why this would be so?
...
-- End of excerpt from "Mohler, Jeff"
On Wed 23 Feb, 2000, "Bruce Sterling Woodcock" sirbruce@ix.netcom.com wrote:
...
Probably not until late this year or next.
Fair enough. I imagine they might appear with new product, as it's been a
while.. *speculates wildly*
...
The complete model depends entirely on your filer model, the OS version,
exactly what disks you have, how much memory, what your op mix is, etc.
Well, yes, but that's not impossible to plug into an analytical model.
I'd also hope that these low-level functions were pretty finely tuned
in DOT and not as amenable to radical change..
...
The make very little difference.  On the order of 10% for most people.
As to WHY, it's because Netapp finely balances their filers for maximum
performance at typical op mixes.  They are not like a competitor, which
throws a bunch of hardware at a problem and you are expected to juggle
it to figure out what works.  Netapp wants to deliver a simple solution.
Yeees. However, there's a touch of faith creeping in here. I don't
think it's misplaced, certainly, I think NetApp's are my favourite
fileservers for these reasons. I guess I'm just getting more cycnical
and evidence-biased as I get older. 8)
...
There are some performance papers on Netapp's web site that address
some of your questions, and people are free to post their personal
experiences here, but what works for one environment may not work
for another.  I would be reluctant to tell you to go out and buy more
drives you don't need if it won't help you.
Indeed, me too. But how can we quantitatively be confident ahead of time?
...
Given that those tables constantly change with the variables (drives,
memory, OS, etc.) rules of thumb is probably the best you can do, unless
you have a few million to spend on testing every possible configuration
and reporting the results. :)
Well I know only one place and one group of people that could undertake
this, and they sell the boxes. I'd be very susprised if the creators,
the engineers, haven't got *something* going on in this line.
...
There is no best.  There is only what is best for your environment.
Yes, I'd like to think my 'best' would be considered as predicated on
the requirements. I'm not an absolutist, or at least I try not to be.
...
I would guess slightly better, since the 18 gig drives are probably faster
and you have 2 loops to share the load.  However, since the bottleneck
will be the NVRAM, I doubt there will be a major noticeable difference.
Ah, I was thinking more about this - the nvram emptying after a cp takes
time and that is a key determinant in whether you'll be taking a cp from
a cp or not, in turn being a key determinant in user-visible performance.
This nvram emptying is surely determined in turn by how rapidly the
blocks can be pushed to disk, in which case the speed of the disks and
the utility of multiple loops may have some impact, surely? Or is the
WAFL so good at distributing writes, and the bandwidth and latency of
a single loop such a good match for the nvram that this is a null-term?
...
I can't consider them if you don't say what they are.
I meant, consider whether they're a big factor or not. I was thinking
about the two scenarios using the same 10-spindle RAID set size.
In the average case the DOT algorithm has to read a few blocks from
each of a few disks from a given RAID set, perform the parity XOR over
those blocks and the blocks it intends to write, generate the parity,
and then write all those new blocks to the disks that it has elected to
write to during a cp flush, always including the parity disk. Having
fewer spindles means more load per spindle, having fewer loops means
more load per loop. How much does the cpu have to do, how much can be
DMA'd or otherwise fobbed-off. How little time can you whittle the
nvram empty down to?
I mean, to be reliable you have to await the completion notice from each
disk for each io sent, and I don't believe 18GB disks to be capable of 2x
the io's of 9GB disks. Serialising 4 RAID sets on one loop though, even
with more disk-ios/second, will probably increase the final-response time
to do a full nvram flush by a goodly chunk over the scenario with 2 loops.
Given a pathalogical filer setup I could see that increasing the nvram
on a write-burdened filer might actually make the problem worse rather
than better. How does an admin, especially one not so very interested
and aware of these things, approach a filer that is struggling but they
don't know why or how to fix it?
...
You can't know if you got it right unless you have an exact simulation
of your particular ops mix and traffic patterns.  Since you don't have
that, then basically you have to use rules of thumb and guess and adjust
when needed.  Your environment changes over time, too, so what works
one day may not work another.
Mmm, all of which makes me lean toward an analytical model, as it should
be more powerful in dealing with these imponderables: plug in the numbers
and see how it comes out. Real testing and simulation would make useful
checks against such a model, natch.
...
Maybe it's just me, but while I love having numbers and data, I've always
found such tuning tasks to be far more intuitive and situational.
Well, yes, I guess this is because we're human and quite good at dealing
with vague and wooly things that nevertheless have patterns that can be
extracted with experience.
...
I mean, if
I have a chart that says such-and-such filer with such-and-such configuration
can handle 200 users, and the filer in my environment overloads at 100
users, I'm not gonna keep adding users.  Conversely, if it runs at 50% and
high cache age with 200 users, I'm not going to worry about adding more.
True. Computers aren't as deterministic as they once were and we have to
deal with what's really there, however I think there is scope for better
models than none-at-all, or 'ask Bruce or Mark, they're good with filers'.
I also wouldn't advocate *such* a simple model, at least not alone.
...
If a filer is overloaded with reads, and I haven't maxxed the RAM yet, I'm
probably going to max the RAM first.  Then I'll worry about adding more
disks, and if that doesn't help, I'll know I need to reduce traffic and get
another filer.  There are some minor variations to this, of course, but I'm
not going to waste a lot of time beforehand trying to predict exactly how
many disks and how much memory I need when the reality can quickly
change.  Estimate, yes, but I won't follow a strict chart.
Okey, so perhaps I've overemphasised the model and chart thing, or
perhaps you're playing devil's advocate to my credulous believer. As a
tool in the salesman's, field engineer's and admin's box'o'tricks though,
a configurator other than than the SPEC result-sheets would be a very
neat thing to have.
And as for wasting a lot of time beforehand - that's rather my point. You
wouldn't have to waste time, or a lot of it.
So long as the results of a model were within a few tens of percent then
your putative admin has gotten close enough to start tuning, rather
than transhipping for the larger machine they thought they might need
but chickened out of. Better, they find they're peaking just below
the capability of the box they ordered and didn't cough up for the
big-expensive beast they'd otherwise have bought "just in case".
...
-- End of excerpt from "Bruce Sterling Woodcock"
-- 
-Mark                     ... an Englishman in London ...

Re: Disk drive technology