Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall <mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'. I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault? I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it. Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2. ^^^ almost 5 minutes! Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1. ^^^ less than 1 min When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that. I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem. hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9% hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown. volumes on hanksata0 besides root (all created over a year ago): volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1. volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1. hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled 4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... . Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'. Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
Questions:
What does the raid layout look like on the aggregate (aggr status -r aggrname)
Did you *ever* let this aggregate fill up or get nearly full (90% or more) before adding more disks?
If you added more disks, how were they added? In other words, what was the layout before and after the disk add?
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 2:15 PM, Adam McDougall mcdouga9@egr.msu.edu wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall <mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'.
I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault?
I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it.
Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2.
^^^ almost 5 minutes!
Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1.
^^^ less than 1 min
When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that.
I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem.
hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9%
hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown.
volumes on hanksata0 besides root (all created over a year ago):
volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1.
volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled
4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... .
Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'.
Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
On 03/01/10 14:49, tmac wrote:
Questions:
What does the raid layout look like on the aggregate (aggr status -r aggrname)
hank> aggr status -r hanksata0 Aggregate hanksata0 (online, raid_dp) (block checksums) Plex /hanksata0/plex0 (online, normal, active) RAID group /hanksata0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.16 0c 1 0 FC:B - ATA 7200 635555/1301618176 635858/1302238304 parity 0c.17 0c 1 1 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.18 0c 1 2 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.19 0c 1 3 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.20 0c 1 4 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.21 0c 1 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.22 0c 1 6 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.23 0c 1 7 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.24 0c 1 8 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.25 0c 1 9 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.26 0c 1 10 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.27 0c 1 11 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.28 0c 1 12 FC:B - ATA 7200 635555/1301618176 635858/1302238304
Did you *ever* let this aggregate fill up or get nearly full (90% or more) before adding more disks?
I have never added more disks to it. I *attempted* to once, but it rejected my request because the aggr would have been over 16T, which is why I created a second aggr just like it with bigger disks that seems to work just fine:
hank> aggr status -r hanksata1 Aggregate hanksata1 (online, raid_dp) (block checksums) Plex /hanksata1/plex0 (online, normal, active) RAID group /hanksata1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.39 0c 2 7 FC:B - ATA 7200 847555/1735794176 847827/1736350304 parity 0c.38 0c 2 6 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.44 0c 2 12 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.43 0c 2 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.37 0c 2 5 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.36 0c 2 4 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.42 0c 2 10 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.35 0c 2 3 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.41 0c 2 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.34 0c 2 2 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.40 0c 2 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.33 0c 2 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.32 0c 2 0 FC:B - ATA 7200 847555/1735794176 847827/1736350304
If you added more disks, how were they added? In other words, what was the layout before and after the disk add?
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 2:15 PM, Adam McDougallmcdouga9@egr.msu.edu wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall<mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'. I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault? I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it. Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2. ^^^ almost 5 minutes! Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1. ^^^ less than 1 min When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that. I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem. hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9% hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown. volumes on hanksata0 besides root (all created over a year ago): volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1. volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1. hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled 4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... . Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'. Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
You added more disks after the fact. Data ONTAP would not have laid out the disks like that if they were all there to begin with.
Somethings that *might* help:
1. Shut down your filer. pull half the disks out of shelf 1 and shelf two and swap them 2. Make sure your are configured for multipath disk I/O -> You should have 0a, 0b, 0c & 0d as controllers. If you can, hook 0a to 1 (module a-in), 0c to 1 (module b-in) If you can, hook 0b to 2 (module a-in), 0d to 2 (module b-in) -> this gives two paths to each disk and splits all your disks into 4 paths versus 1.
If you only have two controllers, make sure one is from 0a/0b and the other is from 0c/0d Connect one to Shelf 1-A Module-input (then daisy chain to shelf 2) Connect one to Shelf 2-B Module-input (then daisy chain to shelf 1)
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 5:58 PM, Adam McDougall mcdouga9@egr.msu.edu wrote:
On 03/01/10 14:49, tmac wrote:
Questions:
What does the raid layout look like on the aggregate (aggr status -r aggrname)
hank> aggr status -r hanksata0 Aggregate hanksata0 (online, raid_dp) (block checksums) Plex /hanksata0/plex0 (online, normal, active) RAID group /hanksata0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.16 0c 1 0 FC:B - ATA 7200 635555/1301618176 635858/1302238304 parity 0c.17 0c 1 1 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.18 0c 1 2 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.19 0c 1 3 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.20 0c 1 4 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.21 0c 1 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.22 0c 1 6 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.23 0c 1 7 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.24 0c 1 8 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.25 0c 1 9 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.26 0c 1 10 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.27 0c 1 11 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.28 0c 1 12 FC:B - ATA 7200 635555/1301618176 635858/1302238304
Did you *ever* let this aggregate fill up or get nearly full (90% or more) before adding more disks?
I have never added more disks to it. I *attempted* to once, but it rejected my request because the aggr would have been over 16T, which is why I created a second aggr just like it with bigger disks that seems to work just fine:
hank> aggr status -r hanksata1 Aggregate hanksata1 (online, raid_dp) (block checksums) Plex /hanksata1/plex0 (online, normal, active) RAID group /hanksata1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.39 0c 2 7 FC:B - ATA 7200 847555/1735794176 847827/1736350304 parity 0c.38 0c 2 6 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.44 0c 2 12 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.43 0c 2 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.37 0c 2 5 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.36 0c 2 4 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.42 0c 2 10 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.35 0c 2 3 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.41 0c 2 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.34 0c 2 2 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.40 0c 2 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.33 0c 2 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.32 0c 2 0 FC:B - ATA 7200 847555/1735794176 847827/1736350304
If you added more disks, how were they added? In other words, what was the layout before and after the disk add?
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 2:15 PM, Adam McDougallmcdouga9@egr.msu.edu wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall<mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'.
I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault?
I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it.
Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2.
^^^ almost 5 minutes!
Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1.
^^^ less than 1 min
When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that.
I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem.
hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9%
hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown.
volumes on hanksata0 besides root (all created over a year ago):
volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1.
volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled
4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... .
Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'.
Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
Sorry, missed the Active/Active Stuff. The multipath is slightly different for a clustered system. Please refer to the multipath I/O guide on now for proper cabling techniques.
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 6:27 PM, tmac tmacmd@gmail.com wrote:
You added more disks after the fact. Data ONTAP would not have laid out the disks like that if they were all there to begin with.
Somethings that *might* help:
- Shut down your filer. pull half the disks out of shelf 1 and shelf
two and swap them 2. Make sure your are configured for multipath disk I/O -> You should have 0a, 0b, 0c & 0d as controllers. If you can, hook 0a to 1 (module a-in), 0c to 1 (module b-in) If you can, hook 0b to 2 (module a-in), 0d to 2 (module b-in) -> this gives two paths to each disk and splits all your disks into 4 paths versus 1.
If you only have two controllers, make sure one is from 0a/0b and the other is from 0c/0d Connect one to Shelf 1-A Module-input (then daisy chain to shelf 2) Connect one to Shelf 2-B Module-input (then daisy chain to shelf 1)
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 5:58 PM, Adam McDougall mcdouga9@egr.msu.edu wrote:
On 03/01/10 14:49, tmac wrote:
Questions:
What does the raid layout look like on the aggregate (aggr status -r aggrname)
hank> aggr status -r hanksata0 Aggregate hanksata0 (online, raid_dp) (block checksums) Plex /hanksata0/plex0 (online, normal, active) RAID group /hanksata0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.16 0c 1 0 FC:B - ATA 7200 635555/1301618176 635858/1302238304 parity 0c.17 0c 1 1 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.18 0c 1 2 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.19 0c 1 3 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.20 0c 1 4 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.21 0c 1 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.22 0c 1 6 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.23 0c 1 7 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.24 0c 1 8 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.25 0c 1 9 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.26 0c 1 10 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.27 0c 1 11 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.28 0c 1 12 FC:B - ATA 7200 635555/1301618176 635858/1302238304
Did you *ever* let this aggregate fill up or get nearly full (90% or more) before adding more disks?
I have never added more disks to it. I *attempted* to once, but it rejected my request because the aggr would have been over 16T, which is why I created a second aggr just like it with bigger disks that seems to work just fine:
hank> aggr status -r hanksata1 Aggregate hanksata1 (online, raid_dp) (block checksums) Plex /hanksata1/plex0 (online, normal, active) RAID group /hanksata1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0c.39 0c 2 7 FC:B - ATA 7200 847555/1735794176 847827/1736350304 parity 0c.38 0c 2 6 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.44 0c 2 12 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.43 0c 2 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.37 0c 2 5 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.36 0c 2 4 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.42 0c 2 10 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.35 0c 2 3 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.41 0c 2 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.34 0c 2 2 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.40 0c 2 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.33 0c 2 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.32 0c 2 0 FC:B - ATA 7200 847555/1735794176 847827/1736350304
If you added more disks, how were they added? In other words, what was the layout before and after the disk add?
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 2:15 PM, Adam McDougallmcdouga9@egr.msu.edu wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall<mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'.
I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault?
I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it.
Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2.
^^^ almost 5 minutes!
Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1.
^^^ less than 1 min
When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that.
I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem.
hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9%
hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown.
volumes on hanksata0 besides root (all created over a year ago):
volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1.
volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled
4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... .
Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'.
Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
Okay you caught me (sort of). I looked back in my documentation just to see. On May 2 2008 I installed this filer by netbooting 7.2.4 and zeroing all the disks. The installer setup 3 disks automatically:
Fri May 2 00:26:23 GMT [raid.vol.disk.add.done:notice]: Addition of Disk /aggr0/plex0/rg0/0c.18 Shelf 1 Bay 2 [NETAPP X268_HGEMIT75SSX A90A] S/N [P8G8W6ZF] to aggregate aggr0 has completed successfully Fri May 2 00:26:23 GMT [raid.vol.disk.add.done:notice]: Addition of Disk /aggr0/plex0/rg0/0c.17 Shelf 1 Bay 1 [NETAPP X268_HGEMIT75SSX A90A] S/N [P8G8TT2F] to aggregate aggr0 has completed successfully Fri May 2 00:26:23 GMT [raid.vol.disk.add.done:notice]: Addition of Disk /aggr0/plex0/rg0/0c.16 Shelf 1 Bay 0 [NETAPP X268_HGEMIT75SSX A90A] S/N [P8G8WG4F] to aggregate aggr0 has completed successfully
I renamed the aggr to hanksata0, renamed vol0 to root, set the raidsize to 13, then I added 10 of the 11 spares on that shelf to hanksata0 with this command: aggr add hanksata0 -d 0c.19 0c.20 0c.21 0c.22 0c.23 0c.24 0c.25 0c.26 0c.27 0c.28
Then I created my data volume: vol create research -l C hanksata0 2900g and eventually copied data to it.
I found from a Weekly status email from the filer on 5/18/2008:
===== DF-R ===== Filesystem kbytes used avail reserved Mounted on /vol/root/ 20971520 267724 20703796 0 /vol/root/ /vol/root/.snapshot 5242880 57268 5185612 0 /vol/root/.snapshot /vol/research/ 2949644288 54024652 2895619636 0 /vol/research/ /vol/research/.snapshot 91226112 40836 91185276 0 /vol/research/.snapshot
===== DF-A ===== Aggregate kbytes used avail capacity hanksata0 6120662048 3067381464 3053280584 50% hanksata0/.snapshot 322140104 1433816 320706288 0%
So as of 5/18/2008 the aggregate existed at its current size with nearly nothing on it (54 gigs looks like).
I'm using 0a/0b on both filers for connections to two other FC loops which were setup at different times, so I did what I could with the resources I had. Yes, ideally I should order more FC interfaces and whatnot to have multipath to each loop but our usage can tolerate a cluster failover if I lose a cable. Smaller stripes would probably be better, but our setup can perform much faster than we need it so I went for shelf-sized aggregates.
I did consider swapping some or all disks between shelves to see what it does but I'm evaluating my options first and starting with gentle changes because its only the backups that are a concern at this time, users are not reporting problems so I don't want to introduce downtime yet. Thanks.
On 03/01/10 18:27, tmac wrote:
You added more disks after the fact. Data ONTAP would not have laid out the disks like that if they were all there to begin with.
Somethings that *might* help:
- Shut down your filer. pull half the disks out of shelf 1 and shelf
two and swap them 2. Make sure your are configured for multipath disk I/O -> You should have 0a, 0b, 0c& 0d as controllers. If you can, hook 0a to 1 (module a-in), 0c to 1 (module b-in) If you can, hook 0b to 2 (module a-in), 0d to 2 (module b-in) -> this gives two paths to each disk and splits all your disks into 4 paths versus 1.
If you only have two controllers, make sure one is from 0a/0b and the other is from 0c/0d Connect one to Shelf 1-A Module-input (then daisy chain to shelf 2) Connect one to Shelf 2-B Module-input (then daisy chain to shelf 1)
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 5:58 PM, Adam McDougallmcdouga9@egr.msu.edu wrote:
On 03/01/10 14:49, tmac wrote:
Questions:
What does the raid layout look like on the aggregate (aggr status -r aggrname)
hank> aggr status -r hanksata0 Aggregate hanksata0 (online, raid_dp) (block checksums) Plex /hanksata0/plex0 (online, normal, active) RAID group /hanksata0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)
Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- --------------
dparity 0c.16 0c 1 0 FC:B - ATA 7200 635555/1301618176
635858/1302238304 parity 0c.17 0c 1 1 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.18 0c 1 2 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.19 0c 1 3 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.20 0c 1 4 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.21 0c 1 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.22 0c 1 6 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.23 0c 1 7 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.24 0c 1 8 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.25 0c 1 9 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.26 0c 1 10 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.27 0c 1 11 FC:B - ATA 7200 635555/1301618176 635858/1302238304 data 0c.28 0c 1 12 FC:B - ATA 7200 635555/1301618176 635858/1302238304
Did you *ever* let this aggregate fill up or get nearly full (90% or more) before adding more disks?
I have never added more disks to it. I *attempted* to once, but it rejected my request because the aggr would have been over 16T, which is why I created a second aggr just like it with bigger disks that seems to work just fine:
hank> aggr status -r hanksata1 Aggregate hanksata1 (online, raid_dp) (block checksums) Plex /hanksata1/plex0 (online, normal, active) RAID group /hanksata1/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)
Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- --------------
dparity 0c.39 0c 2 7 FC:B - ATA 7200 847555/1735794176
847827/1736350304 parity 0c.38 0c 2 6 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.44 0c 2 12 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.43 0c 2 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.37 0c 2 5 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.36 0c 2 4 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.42 0c 2 10 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.35 0c 2 3 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.41 0c 2 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.34 0c 2 2 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.40 0c 2 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.33 0c 2 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304 data 0c.32 0c 2 0 FC:B - ATA 7200 847555/1735794176 847827/1736350304
If you added more disks, how were they added? In other words, what was the layout before and after the disk add?
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Mar 1, 2010 at 2:15 PM, Adam McDougallmcdouga9@egr.msu.edu wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall<mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'. I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault? I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it. Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2. ^^^ almost 5 minutes! Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1. ^^^ less than 1 min When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that. I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem. hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9% hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown. volumes on hanksata0 besides root (all created over a year ago): volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1. volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few
weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled 4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... . Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'. Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!
Run Perfstat ... likely you've got a bad disk / ESH that's causing the slowness.
~Max
On Mar 1, 2010, at 11:15 AM, Adam McDougall wrote:
Approx late Sept-09. I wouldn't be surprised if it was slow before that but I have no real data to back that up.
On 03/01/10 13:50, Jeff Mohler wrote:
How long has this aggregate been over 95% full?
On Mon, Mar 1, 2010 at 10:34 AM, Adam McDougall <mcdouga9@egr.msu.edu mailto:mcdouga9@egr.msu.edu> wrote:
For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'.
I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault?
I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it.
Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2.
^^^ almost 5 minutes!
Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1.
^^^ less than 1 min
When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that.
I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem.
hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9%
hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown.
volumes on hanksata0 besides root (all created over a year ago):
volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1.
volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1.
hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled
4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... .
Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'.
Thanks for thoughts and input!
-- No Signature Required Save The Bits, Save The World!