rrhodes@firstenergycorp.com wrote:
The stats I'm looking at are from a "lun stat" cmd.
Ah OK this is per LUN. So it's about latency and ops distr on the LUN and it has a queue length as well and everything just like a physical disk... :-\ Sorry my experience with LUN based systems (iSCSI or FC-AL based) is rather limited so I cannot really guess of 50 ms is very horrible or just bad -- but something up to 5-15 ms wouldn't worry me too much from what I have seen in the past.
You can run reallocate -p -f <vol_name> as well as reallocate on an individual LUN, but in this case it doesn't look like there's much READ so...
You ran stats with -i 300 during 3600 s (1 h) as I understand it and it says qlen >4. That doesn't feel very pleasant to me, it will invariably lead to fairly high latency I'd expect, it's kindof implicit in that concept if you know what I mean. If that qlen was 1 instead of 4 you'd be at ~12 ms avg latency on those LUNs instead of 50. The Q is why is the qlen so high?
I see you have much more write pressure here than READ, so if anything I think you may have some kind of free space fragmentation issue, especially if the Aggr this is on is getting full. Keeping Aggrs below some % full is highly recommendable, it depends on your workload and patience :-) how high you want to risk going
In the sysstat output, the disk util is pretty high, ~50% is definitely a warning sign for issues. But it's not a disaster quite yet
For example, here are some of the worse lunstats lines for a 1hr interval last Friday.
It shows average latency of 50ms.
Read Write Other QFull Read Write Average Queue Partner Lun Ops Ops Ops kB kB Latency Length Ops kB
10 61 0 0 180 1051 50.55 4.03 0 0 /vol/v_fnce20p_db/q_fnce20p_db/lun2 201401171200 11 61 0 0 171 1070 50.16 4.03 0 0 /vol/v_fnce20p_db/q_fnce20p_db/lun3 201401171200 10 60 0 0 168 1046 50.38 4.02 0 0 /vol/v_fnce20p_db/q_fnce20p_db/lun4 201401171200 10 61 0 0 168 1063 49.80 4.02 0 0 /vol/v_fnce20p_db/q_fnce20p_db/lun1 201401171200 11 60 0 0 171 1051 50.16 4.02 0 0 /vol/v_fnce20p_db/q_fnce20p_db/lun0 201401171200
It looks like they are mostly write ops, but the latency isn't broken down by read/writes. The queue length is 4.
The only snapshot activity is snapmirror/snapvault operations, which occur every 6 hours. And, the snapmirror/snapvault Would have occurred for this volume during this 1hr interval.
I have the sysstat from the same 1hr interval (5 min intervals for 1hr starting 201401171200). The head really isn't very busy. This head is SAN only, so the net activity is snapmirror/snapvault activity.
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 40% 0 0 0 5050 320 12672 34895 23444 0 0 14s 96% 39% 54 39% 9 5041 0 15762 88420 0 0 28% 0 0 0 757 530 21550 42006 15396 0 0 26s 93% 33% 42 38% 8 749 0 9284 16318 0 0 25% 0 0 0 838 209 8578 22343 27128 0 0 3s 94% 50% 88 31% 10 828 0 15800 13103 0 0 46% 0 0 0 700 1871 75567 83102 39238 0 0 1s 91% 43% 47 43% 7 693 0 27099 11266 0 0 14% 0 0 0 482 147 8675 21454 9020 0 0 1 96% 37% 19 31% 9 473 0 6240 13769 0 0 17% 0 0 0 702 143 8047 28066 12103 0 0 1 95% 36% 22 41% 9 693 0 8235 21790 0 0 19% 0 0 0 702 137 7922 26423 24004 0 0 1 95% 38% 35 39% 8 694 0 16770 17597 0 0 18% 0 0 0 642 156 8456 26082 14530 0 0 0s 93% 44% 20 44% 9 633 0 8983 15153 0 0 20% 0 0 0 1005 157 8168 33210 14510 0 0 1 93% 27% 30 53% 10 995 0 9724 23917 0 0 20% 0 0 0 1033 137 7337 25215 20989 0 0 1 92% 33% 35 45% 8 1025 0 14157 17978 0 0 19% 0 0 0 860 136 7757 28052 14805 0 0 1 94% 27% 30f 48% 8 852 0 9464 21196 0 0 19% 0 0 0 860 136 7757 28052 14805 0 0 1 94% 27% 30f 48% 8 852 0 9464 21196 0 0 24% 0 0 0 1101 150 9264 33514 18724 0 0 1 94% 32% 30 49% 9 1092 0 12129 25414 0 0