Approx late Sept-09. I wouldn't be surprised if it was slow before that
but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
> How long has this aggregate been over 95% full?
>
>
>
> On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall <mcdouga9(a)egr.msu.edu
> <mailto:mcdouga9@egr.msu.edu>> wrote:
>
> For a long time we've known backing up our largest volume (3.5T) was
> slow. More recently I've been investigating why and it seems like a
> problem with only that shelf or possibly aggregate. Basically it is
> several times slower than any other shelf/aggregate we have, it
> seems bottlenecked whether I am reading/writing from nfs, ndmp,
> reallocate scans, etc, that shelf is always slower. I will probably
> have a support case opened tomorrow with netapp but I feel like
> checking with the list to see what else I can find out on my own.
> When doing NDMP backups I get only around 230Mbit/sec as opposed to
> 800+ on others. The performance drops distinctly on the hour
> probably for snapshots (see pic). Details below. 0c.25 seems like
> a hot disk but the activity on that aggr also seems too high since
> the network bandwidth is fairly small. A 'reallocate measure' on
> the two large volumes on aggregate hanksata0 both return a score of
> '1'.
>
> I guess my two main questions are, how do I figure out what is
> causing the activity on hanksata0 (especially the hot disk which is
> sometimes at 100%) and if its not just activity but an actual
> problem, how could I further debug the slow performance to find out
> what items are at fault?
>
> I used ndmpcopy to copy a fast volume with large files from another
> filer to a new volume on hanksata0 and hanksata1. The volume on
> hanksata0 is slow but the one on hanksata1 is not. Both of those
> aggregates are on the same loop with hanksata1 terminating it.
>
> Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL
> layout measurement on volume scratchtest.
> Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]:
> Allocation measurement check on
> '/vol/scratchtest' is 2.
>
> ^^^ almost 5 minutes!
>
> Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting
> WAFL layout measurement on volume scratchtest2.
> Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]:
> Allocation measurement check on
> '/vol/scratchtest2' is 1.
>
> ^^^ less than 1 min
>
> When I write to scratchtest, you can see the network bandwidth jump
> up for a few seconds then it stalls for about twice as long,
> presumably so the filer can catch up writing, then it repeats.
> Speed averages around 30-40MB/sec if that.
>
> I even tried using the spare sata disk from both of these shelves
> to make a new volume, copied scratchtest to it (which took 26
> minutes for around 40G), and reads were equally slow as the existing
> scratchtest, although I'm not sure if thats because a single disk is
> too slow to prove anything, or theres a shelf problem.
>
> hanksata0 6120662048 6041632124 79029924 99%
> hanksata0/.snapshot 322140104 14465904 307674200 4%
> hanksata1 8162374688 2191140992 5971233696 27%
> hanksata1/.snapshot 429598664 39636812 389961852 9%
>
> hanksata0 and 1 are both ds14mk2 AT but hanksata0 has
> X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has
> disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has
> been around since we got the filer say around 2 years ago,
> hanksata1 was added within the last half year. Both
> shelves have always had 11 data disks, 2 parity, 1 spare,
> the aggregates were never grown.
>
> volumes on hanksata0 besides root (all created over a year ago):
>
> volume 1 (research):
> NO dedupe (too big)
> 10 million inodes, approx 3.5T, 108G in snapshots
> endures random user read/write but usually fairly light traffic.
> Populated initially with rsync then opened to user access via NFS.
> Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]:
> Allocation measurement check on '/vol/research' is 1.
>
> volume 2 (reinstallbackups):
> dedupe enabled
> 6.6 million files, approx 1.6T, 862G in snapshots
> volume created over a year ago and has several dozen gigs of windows
> PC backups written or read multiple times per week using CIFS but
> otherwise COMPLETELY idle. Older data is generally deleted to
> snapshots after some weeks and the snapshots expire after a few weeks.
> Only accessed via CIFS.
> Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]:
> Allocation measurement check on '/vol/reinstallbackups' is 1.
>
>
> hanksata1 only has one volume besides the small test ones I made,
> it runs plenty fast.
> dedupe enabled
>
> 4.3 million files, approx 1.6T, 12G in snapshots
> created a few months ago on an otherwise unused new aggregate with
> initial rsync,
> then daily rsyncs from another fileserver that is not very active
>
>
>
> disk ut% xfers ureads--chain-usecs writes--chain-usecs
> cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
> /hanksata0/plex0/rg0:
> 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439
> 1.52 2.71 579 0.00 .... . 0.00 .... .
> 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228
> 1.56 2.93 873 0.00 .... . 0.00 .... .
> 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516
> 0.76 5.43 2684 0.00 .... . 0.00 .... .
> 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049
> 0.65 5.56 2840 0.00 .... . 0.00 .... .
> 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469
> 0.72 4.80 3583 0.00 .... . 0.00 .... .
> 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870
> 0.76 5.76 3140 0.00 .... . 0.00 .... .
> 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677
> 0.94 3.58 2710 0.00 .... . 0.00 .... .
> 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875
> 0.76 5.14 3417 0.00 .... . 0.00 .... .
> 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289
> 0.65 3.94 2930 0.00 .... . 0.00 .... .
> 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719
> 0.87 4.63 11658 0.00 .... . 0.00 .... .
> 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097
> 0.69 5.21 2131 0.00 .... . 0.00 .... .
> 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530
> 0.51 6.00 3321 0.00 .... . 0.00 .... .
> 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808
> 0.72 3.40 5191 0.00 .... . 0.00 .... .
>
> Both sata shelves are on controller 0c attached to two 3040.
> Both sata shelves are on controller 0c attached to two 3040.
> Raid-DP in 13-disk raid groups so we have 2 parity and one spare
> per shelf.
> Active-Active single path HA.
> Latest firmwares/code as of beginning of the year. 7.3.2.
> no VMs, no snapmirror, nothing fancy that I can think of.
> wafl scan status only shows 'active bitmap rearrangement' or
> 'container block reclamation'.
>
> Thanks for thoughts and input!
>
>
>
>
> --
> No Signature Required
> Save The Bits, Save The World!