Slow aggregate/shelf, hot disk - toasters

1 Mar 2010


      For a long time we've known backing up our largest volume (3.5T) was 
slow.  More recently I've been investigating why and it seems like a 
problem with only that shelf or possibly aggregate.  Basically it is 
several times slower than any other shelf/aggregate we have, it 
seems bottlenecked whether I am reading/writing from nfs, ndmp, 
reallocate scans, etc, that shelf is always slower.  I will probably
have a support case opened tomorrow with netapp but I feel like 
checking with the list to see what else I can find out on my own.
When doing NDMP backups I get only around 230Mbit/sec as opposed to
800+ on others.  The performance drops distinctly on the hour 
probably for snapshots (see pic).  Details below.  0c.25 seems like
a hot disk but the activity on that aggr also seems too high since 
the network bandwidth is fairly small.  A 'reallocate measure' on
the two large volumes on aggregate hanksata0 both return a score of 
'1'.
I guess my two main questions are, how do I figure out what is 
causing the activity on hanksata0 (especially the hot disk which is
sometimes at 100%) and if its not just activity but an actual 
problem, how could I further debug the slow performance to find out 
what items are at fault?
I used ndmpcopy to copy a fast volume with large files from another 
filer to a new volume on hanksata0 and hanksata1.  The volume on 
hanksata0 is slow but the one on hanksata1 is not.  Both of those 
aggregates are on the same loop with hanksata1 terminating it.
Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL
layout measurement on volume scratchtest.
Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]:
Allocation measurement check on
'/vol/scratchtest' is 2.
^^^ almost 5 minutes!
Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting 
WAFL layout measurement on volume scratchtest2.
Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: 
Allocation measurement check on 
'/vol/scratchtest2' is 1.
^^^ less than 1 min
When I write to scratchtest, you can see the network bandwidth jump 
up for a few seconds then it stalls for about twice as long, 
presumably so the filer can catch up writing, then it repeats.
Speed averages around 30-40MB/sec if that.
I even tried using the spare sata disk from both of these shelves
to make a new volume, copied scratchtest to it (which took 26 
minutes for around 40G), and reads were equally slow as the existing 
scratchtest, although I'm not sure if thats because a single disk is 
too slow to prove anything, or theres a shelf problem.
hanksata0           6120662048 6041632124   79029924      99%  
hanksata0/.snapshot  322140104   14465904  307674200       4%  
hanksata1           8162374688 2191140992 5971233696      27%  
hanksata1/.snapshot  429598664   39636812  389961852       9%
hanksata0 and 1 are both ds14mk2 AT but hanksata0 has
X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has
disks X269_HGEMI aka X269A-R5 (1T x 14).  hanksata0 has
been around since we got the filer say around 2 years ago,
hanksata1 was added within the last half year.  Both 
shelves have always had 11 data disks, 2 parity, 1 spare,
the aggregates were never grown.
volumes on hanksata0 besides root (all created over a year ago):
volume 1 (research):
NO dedupe (too big)
10 million inodes, approx 3.5T, 108G in snapshots
endures random user read/write but usually fairly light traffic.
Populated initially with rsync then opened to user access via NFS.
Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: 
Allocation measurement check on '/vol/research' is 1.
volume 2 (reinstallbackups):
dedupe enabled
6.6 million files, approx 1.6T, 862G in snapshots
volume created over a year ago and has several dozen gigs of windows 
PC backups written or read multiple times per week using CIFS but 
otherwise COMPLETELY idle.  Older data is generally deleted to 
snapshots after some weeks and the snapshots expire after a few weeks.
Only accessed via CIFS.
Mon Mar  1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: 
Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made,
it runs plenty fast.
dedupe enabled
4.3 million files, approx 1.6T, 12G in snapshots
created a few months ago on an otherwise unused new aggregate with 
initial rsync, 
then daily rsyncs from another fileserver that is not very active
disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/hanksata0/plex0/rg0:
0c.16              7   5.69    0.94   1.00 55269   3.22   3.02  2439   1.52   2.71   579   0.00   ....     .   0.00   ....     . 
0c.17              9   6.34    0.94   1.00 74308   3.84   2.86  2228   1.56   2.93   873   0.00   ....     .   0.00   ....     . 
0c.18             63 121.00  118.86   1.01 30249   1.38   3.26  3516   0.76   5.43  2684   0.00   ....     .   0.00   ....     . 
0c.19             60 117.74  116.69   1.00 30546   0.40   3.73  5049   0.65   5.56  2840   0.00   ....     .   0.00   ....     . 
0c.20             60 120.82  119.66   1.02 29156   0.43   5.33  5469   0.72   4.80  3583   0.00   ....     .   0.00   ....     . 
0c.21             60 119.37  118.25   1.02 29654   0.36   4.60  5870   0.76   5.76  3140   0.00   ....     .   0.00   ....     . 
0c.22             62 124.87  123.32   1.02 29423   0.62   5.65  5677   0.94   3.58  2710   0.00   ....     .   0.00   ....     . 
0c.23             62 119.48  118.35   1.03 30494   0.36   4.00  6875   0.76   5.14  3417   0.00   ....     .   0.00   ....     . 
0c.24             61 119.08  117.96   1.02 29981   0.47   6.92  3289   0.65   3.94  2930   0.00   ....     .   0.00   ....     . 
0c.25             93 118.17  116.72   1.03 45454   0.58   4.00 17719   0.87   4.63 11658   0.00   ....     .   0.00   ....     . 
0c.26             61 121.40  120.27   1.04 29271   0.43   7.75  3097   0.69   5.21  2131   0.00   ....     .   0.00   ....     . 
0c.27             59 115.75  114.81   1.03 29820   0.43   5.50  4530   0.51   6.00  3321   0.00   ....     .   0.00   ....     . 
0c.28             63 125.53  124.15   1.01 30302   0.65   6.94  3808   0.72   3.40  5191   0.00   ....     .   0.00   ....     .
Both sata shelves are on controller 0c attached to two 3040.
Both sata shelves are on controller 0c attached to two 3040.
Raid-DP in 13-disk raid groups so we have 2 parity and one spare
per shelf.
Active-Active single path HA.
Latest firmwares/code as of beginning of the year. 7.3.2.
no VMs, no snapmirror, nothing fancy that I can think of.  
wafl scan status only shows 'active bitmap rearrangement' or 
'container block reclamation'.
Thanks for thoughts and input!