Edward, I not sure I would be concerned about the 0's on disk writes you are seeing. As long as Net KB keeps coming in/out. The Netapp may be recalculating parity and holding the writes in NVRAM before it flushes to disk, etc.. As for the volume being full, I see a lot of customers only add 1 disk at a time when the volumes get full. This causes a problem because you basically do not stripe across disks. You only are writing to one disk. At this point you become disk bound which could very well be what is happening here. Make sure you add disks a few at a time when you grow your volume to stripe the data among drives. Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Wednesday, February 06, 2002 8:29 AM To: 'Mike Ball' Subject: RE: Slow write performance
Mike,
I can send you a couple of the outputs straight away. sysstat doesn't recognise the -u option; what's the difference between that and sysstat 1?
I take your point about network settings - we've had problems in this area before - but I don't understand why I would get 0 disk writes for a long time while still getting network traffic in and out. I know that my network traffic during such a period includes write requests (from pktt trace), so what I'm wondering is what causes the writes to hang up and then recover.
The disk on /vol/vol0 has previously got very full. Although we've now freed up some space, do you think there might be persistant fragmentation?
Edward.
nacolt1*> df Filesystem kbytes used avail capacity Mounted on /vol/vol0/ 95328924 54170444 41158480 57% /vol/vol0/ /vol/vol0/.snapshot 0 0 0 ---% /vol/vol0/.snapshot /vol/vol1/ 63552616 60599768 2952848 95% /vol/vol1/ /vol/vol1/.snapshot 0 0 0 ---% /vol/vol1/.snapshot
Volume vol0 (root)
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks) --------- ----- ------------ ---- -------------- -------------- parity 8.3 8 0 3 FC:A 34500/70656000 35003/71687368 data 8.4 8 0 4 FC:A 34500/70656000 35003/71687368 data 8.1 8 0 1 FC:A 34500/70656000 35003/71687368 data 8.0 8 0 0 FC:A 34500/70656000 35003/71687368
Volume vol1
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks) --------- ----- ------------ ---- -------------- -------------- parity 8.2 8 0 2 FC:A 34500/70656000 35003/71687368 data 8.6 8 0 6 FC:A 34500/70656000 35003/71687368 data 8.5 8 0 5 FC:A 34500/70656000 35003/71687368 nacolt1*>
-----Original Message----- From: Mike Ball [mailto:MBall@DATALINK.com] Sent: Wednesday, February 06, 2002 1:07 PM To: Edward Hibbert Subject: RE: Slow write performance
Edward, Trying forcing the switch you plug into to 100Mb Full duplex. In addition, instead of making e0 auto negotiate, force it to 100Mb full duplex. Your network KB/s is pretty low. Also, how full is your file system? Can you send me the output of the following commands:
df sysstat -u 1 (run for 20 seconds) sysconfig -r
Thank you, Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Tuesday, February 05, 2002 5:27 PM To: 'Mike Ball' Subject: RE: Slow write performance
Here's the sysstat. Interesting that this shows periodic 0 writes.
Tue Feb 5 22:22:30 GMT [tn_login_0]: root logged in from host: 192.168.21.16 sysstat 1 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 54% 1012 0 1 393 1494 2430 2446 0 0 1 43% 1012 0 0 431 1549 2302 3947 0 0 1 46% 1278 0 0 471 1787 2708 1116 0 0 1 57% 1080 0 0 611 2019 2078 5177 0 0 1 31% 726 0 0 439 1056 840 1028 0 0 1 37% 962 0 0 388 1872 1561 0 0 0 1 32% 919 0 0 404 1786 1688 0 0 0 1 33% 1050 0 0 407 1815 1529 0 0 0 1 43% 1102 0 0 623 2148 1264 0 0 0 1 43% 857 0 1 347 1585 2154 0 0 0 1 38% 746 0 0 430 1878 1704 0 0 0 1 58% 463 0 1 255 800 2610 3675 0 0 1 52% 997 0 0 517 1544 2840 2084 0 0 1 48% 792 0 0 401 1350 2222 5365 0 0 1 50% 1028 0 0 429 2055 2264 2108 0 0 1 37% 864 0 0 304 1864 1893 2210 0 0 1 39% 1053 0 0 566 2154 988 8 0 0 1 45% 1143 0 0 604 2309 1733 0 0 0 1 35% 1034 0 0 420 2123 1860 0 0 0 1 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 42% 1000 0 0 550 2016 1264 0 0 0 1 52% 1118 0 1 575 2541 1589 0 0 0 1 30% 791 0 0 327 1648 1880 0 0 0 1 48% 880 0 0 382 1896 2022 1905 0 0 1 47% 745 0 1 354 1354 2196 2828 0 0 1 57% 1287 0 0 581 1772 2366 3583 0 0 1 51% 1136 0 0 595 1929 2488 1168 0 0 1 48% 987 0 0 419 1273 2348 3504 0 0 1 35% 768 0 0 421 1623 1597 720 0 0 1 45% 1237 0 0 624 2465 1869 0 0 0 1 38% 1112 0 0 390 1995 1688 0 0 0 1 44% 1044 0 0 585 1817 1697 0 0 0 1
ifconfig output:
e0: flags=240043<UP,BROADCAST,RUNNING,UP_1ARY,LINK_UP> mtu 1500 inet 192.168.21.81 netmask 0xffffff00 broadcast 192.168.21.255 ether 00:a0:98:00:97:67 (auto-100tx-fd-up) lo: flags=240049<UP,LOOPBACK,RUNNING,UP_1ARY,LINK_UP> mtu 1536 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
Regards,
Edward.
-----Original Message----- From: Mike Ball [mailto:MBall@DATALINK.com] Sent: 05 February 2002 19:58 To: Edward Hibbert; toasters Subject: RE: Slow write performance
Edward, Telnet to the netapp and type "sysstat 1". Let it run for about 20 seconds and send us the output. Also, type "ifconfig -a" and send us the output. Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Tuesday, February 05, 2002 2:24 PM To: toasters Subject: Slow write performance
We're seeing some performance problems on an F720 which you guys might be able to help with (even though it's not exactly top of the range nowadays).
What we see is: - We're driving it over NFS v3. - We get about 1500 ops/sec out of it, of which one third are writes and two thirds are reads. There aren't many file open/close operations. - The operations are to random locations in large (10GB) files. - The CPU is running about 75%, and the network input and output below the throughput of the link we have to it. We've seen both CPU and network go higher if we do simple copy tests. - Looking it it via pktt trace, something approaching 15% of WRITE operations take long enough for the clients to time out and retransmit (so at least 1 second). None of the READ operations do. - The retransmissions appear to come in bunches. For example we'll see a few seconds where the filer doesn't respond, during which time the retransmissions will come in, then it will wake up and send some responses back. - The rest of the time, the WRITES are very fast (sub-ms).
This appears to have worsened recently. We tried a couple of things: - We thought that this might be because the disk had got full and fragmented, so we zapped a bunch of data. - We rebooted. Neither of these seemed to help much.
sysstat consistently shows a cache age of 1. This, and the bursty nature of the delays, suggest to me that I'm just hitting it too hard, and there's some kind of periodic cache-flushing operation going on, but do any of you folk have any other suggestions?
Edward Hibbert Internet Applications Group Data Connection Ltd Tel: +44 131 662 1212 Fax: +44 131 662 1345 Email: eh@dataconnection.com Web: http://www.dataconnection.com
This was an issue for sure, I've been through it myself. However, you can use the wafl commands to check the layout of data and, if necessary, re-stripe it. I think the '1 disk add' issue is no longer a real issue....
~JK
Mike Ball wrote:
Edward, I not sure I would be concerned about the 0's on disk writes you are seeing. As long as Net KB keeps coming in/out. The Netapp may be recalculating parity and holding the writes in NVRAM before it flushes to disk, etc.. As for the volume being full, I see a lot of customers only add 1 disk at a time when the volumes get full. This causes a problem because you basically do not stripe across disks. You only are writing to one disk. At this point you become disk bound which could very well be what is happening here. Make sure you add disks a few at a time when you grow your volume to stripe the data among drives. Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Wednesday, February 06, 2002 8:29 AM To: 'Mike Ball' Subject: RE: Slow write performance
Mike,
I can send you a couple of the outputs straight away. sysstat doesn't recognise the -u option; what's the difference between that and sysstat 1?
I take your point about network settings - we've had problems in this area before - but I don't understand why I would get 0 disk writes for a long time while still getting network traffic in and out. I know that my network traffic during such a period includes write requests (from pktt trace), so what I'm wondering is what causes the writes to hang up and then recover.
The disk on /vol/vol0 has previously got very full. Although we've now freed up some space, do you think there might be persistant fragmentation?
Edward.
nacolt1*> df Filesystem kbytes used avail capacity Mounted on /vol/vol0/ 95328924 54170444 41158480 57% /vol/vol0/ /vol/vol0/.snapshot 0 0 0 ---% /vol/vol0/.snapshot /vol/vol1/ 63552616 60599768 2952848 95% /vol/vol1/ /vol/vol1/.snapshot 0 0 0 ---% /vol/vol1/.snapshot
Volume vol0 (root)
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks)
parity 8.3 8 0 3 FC:A 34500/70656000 35003/71687368 data 8.4 8 0 4 FC:A 34500/70656000 35003/71687368 data 8.1 8 0 1 FC:A 34500/70656000 35003/71687368 data 8.0 8 0 0 FC:A 34500/70656000 35003/71687368
Volume vol1
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks)
parity 8.2 8 0 2 FC:A 34500/70656000 35003/71687368 data 8.6 8 0 6 FC:A 34500/70656000 35003/71687368 data 8.5 8 0 5 FC:A 34500/70656000 35003/71687368 nacolt1*>
-----Original Message----- From: Mike Ball [mailto:MBall@DATALINK.com] Sent: Wednesday, February 06, 2002 1:07 PM To: Edward Hibbert Subject: RE: Slow write performance
Edward, Trying forcing the switch you plug into to 100Mb Full duplex. In addition, instead of making e0 auto negotiate, force it to 100Mb full duplex. Your network KB/s is pretty low. Also, how full is your file system? Can you send me the output of the following commands:
df sysstat -u 1 (run for 20 seconds) sysconfig -r
Thank you, Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Tuesday, February 05, 2002 5:27 PM To: 'Mike Ball' Subject: RE: Slow write performance
Here's the sysstat. Interesting that this shows periodic 0 writes.
Tue Feb 5 22:22:30 GMT [tn_login_0]: root logged in from host: 192.168.21.16 sysstat 1 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 54% 1012 0 1 393 1494 2430 2446 0 0 1 43% 1012 0 0 431 1549 2302 3947 0 0 1 46% 1278 0 0 471 1787 2708 1116 0 0 1 57% 1080 0 0 611 2019 2078 5177 0 0 1 31% 726 0 0 439 1056 840 1028 0 0 1 37% 962 0 0 388 1872 1561 0 0 0 1 32% 919 0 0 404 1786 1688 0 0 0 1 33% 1050 0 0 407 1815 1529 0 0 0 1 43% 1102 0 0 623 2148 1264 0 0 0 1 43% 857 0 1 347 1585 2154 0 0 0 1 38% 746 0 0 430 1878 1704 0 0 0 1 58% 463 0 1 255 800 2610 3675 0 0 1 52% 997 0 0 517 1544 2840 2084 0 0 1 48% 792 0 0 401 1350 2222 5365 0 0 1 50% 1028 0 0 429 2055 2264 2108 0 0 1 37% 864 0 0 304 1864 1893 2210 0 0 1 39% 1053 0 0 566 2154 988 8 0 0 1 45% 1143 0 0 604 2309 1733 0 0 0 1 35% 1034 0 0 420 2123 1860 0 0 0 1 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 42% 1000 0 0 550 2016 1264 0 0 0 1 52% 1118 0 1 575 2541 1589 0 0 0 1 30% 791 0 0 327 1648 1880 0 0 0 1 48% 880 0 0 382 1896 2022 1905 0 0 1 47% 745 0 1 354 1354 2196 2828 0 0 1 57% 1287 0 0 581 1772 2366 3583 0 0 1 51% 1136 0 0 595 1929 2488 1168 0 0 1 48% 987 0 0 419 1273 2348 3504 0 0 1 35% 768 0 0 421 1623 1597 720 0 0 1 45% 1237 0 0 624 2465 1869 0 0 0 1 38% 1112 0 0 390 1995 1688 0 0 0 1 44% 1044 0 0 585 1817 1697 0 0 0 1
ifconfig output:
e0: flags=240043<UP,BROADCAST,RUNNING,UP_1ARY,LINK_UP> mtu 1500 inet 192.168.21.81 netmask 0xffffff00 broadcast 192.168.21.255 ether 00:a0:98:00:97:67 (auto-100tx-fd-up) lo: flags=240049<UP,LOOPBACK,RUNNING,UP_1ARY,LINK_UP> mtu 1536 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
Regards,
Edward.
-----Original Message----- From: Mike Ball [mailto:MBall@DATALINK.com] Sent: 05 February 2002 19:58 To: Edward Hibbert; toasters Subject: RE: Slow write performance
Edward, Telnet to the netapp and type "sysstat 1". Let it run for about 20 seconds and send us the output. Also, type "ifconfig -a" and send us the output. Mike
-----Original Message----- From: Edward Hibbert [mailto:EH@dataconnection.com] Sent: Tuesday, February 05, 2002 2:24 PM To: toasters Subject: Slow write performance
We're seeing some performance problems on an F720 which you guys might be able to help with (even though it's not exactly top of the range nowadays).
What we see is:
- We're driving it over NFS v3.
- We get about 1500 ops/sec out of it, of which one third are writes and two
thirds are reads. There aren't many file open/close operations.
- The operations are to random locations in large (10GB) files.
- The CPU is running about 75%, and the network input and output below the
throughput of the link we have to it. We've seen both CPU and network go higher if we do simple copy tests.
- Looking it it via pktt trace, something approaching 15% of WRITE
operations take long enough for the clients to time out and retransmit (so at least 1 second). None of the READ operations do.
- The retransmissions appear to come in bunches. For example we'll see a
few seconds where the filer doesn't respond, during which time the retransmissions will come in, then it will wake up and send some responses back.
- The rest of the time, the WRITES are very fast (sub-ms).
This appears to have worsened recently. We tried a couple of things:
- We thought that this might be because the disk had got full and
fragmented, so we zapped a bunch of data.
- We rebooted.
Neither of these seemed to help much.
sysstat consistently shows a cache age of 1. This, and the bursty nature of the delays, suggest to me that I'm just hitting it too hard, and there's some kind of periodic cache-flushing operation going on, but do any of you folk have any other suggestions?
Edward Hibbert Internet Applications Group Data Connection Ltd Tel: +44 131 662 1212 Fax: +44 131 662 1345 Email: eh@dataconnection.com Web: http://www.dataconnection.com
MBall@datalink.com (Mike Ball) writes
I not sure I would be concerned about the 0's on disk writes you are seeing. As long as Net KB keeps coming in/out. The Netapp may be recalculating parity and holding the writes in NVRAM before it flushes to disk, etc..
He should be worried about there being too *few* 0's! As anyone who has done a "sysstat 1" (or just watched the pretty lights on the discs :-]) knows, ONTAP does disk writes in bursts (CPs). It will do a CP about once every 11 seconds unless NVRAM gets half-full (or some other exceptional condition happens) first.
Edward's CP's seem to be taking about 5-6 seconds each: that's definitely not normal. If he could do "sysstat -u" (presumably the reason he can't is that he is still on ONTAP 5.x), then he would see something like 50% in the "CP time" column. The cache age being stuck down at 1 minute all the time doesn't look healthy either.
It may be that the disc read traffic is what is making the CPs take so long. That's an effect I have observed during dumps when obviously there is a lot of read traffic, but the CP-expansion is not usually as extreme as in this case.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QH, Phone: +44 1223 334715 United Kingdom.