Slow sequential disk reads on F740

List overview All Threads
Download

newer

older

FW: Slow sequential disk reads on...

RE: Server Manager

Brian Tao

1 Apr 1999 1 Apr '99

6 a.m.

64MB file on an F740 running 5.1.2, minra=on on that RAID volume. The filesystem is mounted via NFSv3 and UDP transport. Network connection is fast Ethernet via crossover cable.

First run is from a Sun E450 reading in the file with dd and 32K block size. Second run is from another E450 (same network config, different NIC on the F740) reading in the same file. Reading a file from disk seems to be exceptionally slow. Reading the same file from cache is faster, but still not close to what I'd expect over an otherwise quiet 100 Mbps link. Is there something I need to tune?

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 19% 455 0 0 92 3890 3639 0 0 0 7 18% 437 0 0 90 3732 3507 0 0 0 7 18% 436 0 0 89 3720 3480 0 0 0 7 16% 364 0 0 74 3109 2922 0 0 0 7 18% 426 0 0 86 3638 3411 0 0 0 7 18% 435 0 0 88 3712 3480 0 0 0 7 [...]

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 12% 560 0 0 122 4708 0 0 0 0 2 19% 1029 0 0 224 8780 0 0 0 0 2 19% 993 0 0 217 8472 0 0 0 0 2 19% 1017 0 0 222 8677 0 0 0 0 2 20% 1025 0 0 224 8746 0 0 0 0 2 19% 1017 0 0 222 8677 0 0 0 0 2 [...]

-- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"

Show replies by date

Val Bercovici (NetApp)

1 Apr 1 Apr

6:34 a.m.

Hi Brian,

...

64MB file on an F740 running 5.1.2, minra=on on that RAID volume.
The filesystem is mounted via NFSv3 and UDP transport. Network connection is fast Ethernet via crossover cable.

What are results with minra=off? Wouldn't that be a better setting for large sequential files?

...

First run is from a Sun E450 reading in the file with dd and 32K
block size. Second run is from another E450 (same network config, different NIC on the F740) reading in the same file. Reading a file from disk seems to be exceptionally slow.

I agree, this seems slow. How many disks in that volume/raid group in question? Could that file somehow be very fragmented?

...

Reading the same file from cache is faster, but still not close to what I'd expect over an otherwise quiet 100 Mbps link.

Can't agree with you here. You're getting roughly 8.4 MB/s over link capable of about 10-12.5 MB/s max. Not optimal, but not bad either...

...

Is there something I need to tune?

Maybe. Can you tell me more about the NFS patch level and mount options?

-Val. ============================================== Val Bercovici Office: (613)724-8674 Systems Engineer Pager: (800)566-1751 Network Appliance valb@netapp.com Ottawa, Canada FAST,SIMPLE,RELIABLE ==============================================

...

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 19% 455 0 0 92 3890 3639 0 0 0 7 18% 437 0 0 90 3732 3507 0 0 0 7 18% 436 0 0 89 3720 3480 0 0 0 7 16% 364 0 0 74 3109 2922 0 0 0 7 18% 426 0 0 86 3638 3411 0 0 0 7 18% 435 0 0 88 3712 3480 0 0 0 7 [...]

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 12% 560 0 0 122 4708 0 0 0 0 2 19% 1029 0 0 224 8780 0 0 0 0 2 19% 993 0 0 217 8472 0 0 0 0 2 19% 1017 0 0 222 8677 0 0 0 0 2 20% 1025 0 0 224 8746 0 0 0 0 2 19% 1017 0 0 222 8677 0 0 0 0 2 [...]

-- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"

Brian Tao

5:35 p.m.

On Thu, 1 Apr 1999, Val Bercovici (NetApp) wrote:

...

What are results with minra=off? Wouldn't that be a better setting for large sequential files?

Yeah, it definitely is.... I'm getting peaks of almost 15MB/sec disk reads with three clients over three 100tx ports reading three different 256MB files. sysstat reports ~55% CPU usage at that point. Reading all files out of cache gets me close to 24MB/sec (at roughly the same CPU usage).

Is the disk bandwidth limited because of parity calculations on reads? I just realized that RAID literature seems to concentrate on disk write performance hits with a parity system, but it would be reasonable that a RAID will calculate and compare parity for reads as well. That would explain the upper limit on sequential read throughput.

...

I agree, this seems slow. How many disks in that volume/raid group in question? Could that file somehow be very fragmented?

5x9GB drives, tests were done on a mostly empty filesystem with 3x256MB and 4x64MB files laid down sequentially (no other write activity in between).

...

...
Reading the same file from cache is faster, but still not close to what I'd expect over an otherwise quiet 100 Mbps link.

Can't agree with you here. You're getting roughly 8.4 MB/s over link capable of about 10-12.5 MB/s max. Not optimal, but not bad either...

I'm also quite willing to believe that Solaris 2.6 doesn't have the fastest NFS client code either... my FreeBSD box accessing the same Netapp over a switched 100t LAN manages to sustain 9200K/sec. ;-)

-- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"

Val Bercovici (NetApp)

6:15 p.m.

...

...
What are results with minra=off? Wouldn't that be a better setting for large sequential files?
Yeah, it definitely is.... I'm getting peaks of almost 15MB/sec
disk reads with three clients over three 100tx ports reading three different 256MB files. sysstat reports ~55% CPU usage at that point. Reading all files out of cache gets me close to 24MB/sec (at roughly the same CPU usage).

Good. Glad to hear this.

...

Is the disk bandwidth limited because of parity calculations on
reads? I just realized that RAID literature seems to concentrate on disk write performance hits with a parity system, but it would be reasonable that a RAID will calculate and compare parity for reads as well. That would explain the upper limit on sequential read throughput.

Actually, unless we're in degraded mode (meaning a disk has failed and we either have no spare disk to rebuild onto or we're actually in the time window of the process of rebuilding to the hot spare disk) there should be no RAID overhead whatsoever on reads. I'm sure Guy or someone will correct me if I'm wrong here...

...

...
I agree, this seems slow. How many disks in that volume/raid group in question? Could that file somehow be very fragmented?
5x9GB drives, tests were done on a mostly empty filesystem with
3x256MB and 4x64MB files laid down sequentially (no other write activity in between).

5 data disks or 5 disks including parity (meaning 4 data disks)? Either way, that's well below our sweet spot of 14 9GB disks per raid group. If sequential performance is critical I would obviously consider adding more drives. FYI - I have no idea what our sweet spot is for 18GB drives. Either way. I suspect this is now your bottleneck.

...

...
...
Reading the same file from cache is faster, but still not close to what I'd expect over an otherwise quiet 100 Mbps link.

Can't agree with you here. You're getting roughly 8.4 MB/s over link capable of about 10-12.5 MB/s max. Not optimal, but not bad either...
I'm also quite willing to believe that Solaris 2.6 doesn't have
the fastest NFS client code either... my FreeBSD box accessing the same Netapp over a switched 100t LAN manages to sustain 9200K/sec. ;-)

I always suspect NFS client code <g>. Actually, what are your mount options? They may also provide some clues...

Brian Tao

9:16 p.m.

On Thu, 1 Apr 1999, Val Bercovici (NetApp) wrote:

...

Actually, unless we're in degraded mode (meaning a disk has failed and we either have no spare disk to rebuild onto or we're actually in the time window of the process of rebuilding to the hot spare disk) there should be no RAID overhead whatsoever on reads. I'm sure Guy or someone will correct me if I'm wrong here...

How does the Netapp know if there is bad data on reads then? Does it rely on the drive to signal bit errors?

...

5 data disks or 5 disks including parity (meaning 4 data disks)?

7x9GB drives total, 5 data, 1 parity, 1 hot spare.

...

Either way, that's well below our sweet spot of 14 9GB disks per raid group. If sequential performance is critical I would obviously consider adding more drives. FYI - I have no idea what our sweet spot is for 18GB drives. Either way. I suspect this is now your bottleneck.

Eek, 14 drives? I find that I run out of CPU cycles or raw throughput before I hit a storage capacity limit. With the 9GB drives being discontinued, having to buy 14x18GB drives (if the sweet spot is the same) because of performance instead of storage seems like a waste to me.

...

I always suspect NFS client code <g>. Actually, what are your mount options? They may also provide some clues...

On Solaris, "mount -o proto=udp,vers=3". On FreeBSD, "mount -o udpmnt,nfsv3". I believe both OS's default to 32K r/w sizes for NFSv3. Running nfsiod on FreeBSD also makes a huge difference, I found (5.5MB/sec vs. 9MB/sec).

-- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"

Dave Hitz

9:33 p.m.

...

How does the Netapp know if there is bad data on reads then?  Does
it rely on the drive to signal bit errors?

We rely on the disk's build-in data checking to tell us when something went wrong.

This is why you can get away with a simple parity approach.

Even we if we did do XOR calculations for every read, that wouldn't give us enough information to fix the problem. All we would know is that a given stripe was inconsistent. The parity information is only sufficient to fix the data if you also know which block to fix. (You could design RAID around a full ECC code, but that would create quite a bit more overhead.)

The performance hit of doing RAID parity on every read would be astoundingly dismal -- much worse even than for writing. To write a single block in a stripe, you read that block and the parity block, do some math, and then write both blocks -- a total of 4 I/Os for the write. And that's true even if you've got 20 disks in your array. By contrast, to do checking on a READ with a 20 disk stripe, you would have to read a block from all 20 disks, for a total of 20 I/Os for the read. YOW!

And that doesn't even take into account the fact that for writes, we can do WAFL's write anywhere cleverness to avoid seeks, and write multiple blocks in a stripe to reduce that 4-to-1 penalty. Reads tend to come in randomly, so the pain is harder to reduce.

Dave

Brian Tao

9:49 p.m.

On Thu, 1 Apr 1999, Dave Hitz wrote:

...

Even we if we did do XOR calculations for every read, that wouldn't give us enough information to fix the problem. All we would know is that a given stripe was inconsistent.

Yeah, that makes sense now that I think about it. :)

...

The parity information is only sufficient to fix the data if you also know which block to fix. (You could design RAID around a full ECC code, but that would create quite a bit more overhead.)

BTW, is the parity calculation handled by the main filer CPU, or is it offloaded to a dedicated piece of silicon on the disk controller?

-- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"

guy＠netapp.com

2 Apr 2 Apr

9:39 p.m.

...

BTW, is the parity calculation handled by the main filer CPU,

XORing for RAID is done in the main appliance CPU.

tkaczma＠gryf.net

5:34 a.m.

On Thu, 1 Apr 1999, Dave Hitz wrote:

...

(You could design RAID around a full ECC code, but that would create quite a bit more overhead.)

What kind of consistency checks do the disks do? I hope it's at least CRC. I must admit the thought on implementing error detection in HDs never occured to me.

...

And that doesn't even take into account the fact that for writes, we can do WAFL's write anywhere cleverness to avoid seeks, and write multiple blocks in a stripe to reduce that 4-to-1 penalty. Reads tend to come in randomly, so the pain is harder to reduce.

How do you deal with write requests that are smaller than your block size? Don't you still have to read in the block and patch it?

I see a new area for my "abusive" testing once I get past waiting on the phone for tech support to respond to my now 2 day outage of a production machine. I have a priority 1 ticket and still no one bothers to call me.

BTW, could you change your muzak or at least make it several times longer?

Tom

tkaczma＠gryf.net

5:14 a.m.

On Thu, 1 Apr 1999, Val Bercovici (NetApp) wrote:

...

5 data disks or 5 disks including parity (meaning 4 data disks)? Either way, that's well below our sweet spot of 14 9GB disks per raid group.

Wow, I'm discovering my psychic abilities. 14 is exactly the raid group I chose on our filers. Can you share some data on how NAC arrived at this number. In our case it just happened that we wanted to have two volumes as separate as we could make them and we happened to have 4 shelves in each filer.

Tom

9616

Age (days ago)

9617

Last active (days ago)

toasters@lists.teaparty.net

9 comments

5 participants

tags (0)

participants (5)

Brian Tao
Dave Hitz
guy＠netapp.com
tkaczma＠gryf.net
Val Bercovici (NetApp)