Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
Jr> I have an 8040 with 88 x 900G 10k disks, all assigned to a single Jr> aggregate on one of the controllers. There are a few volumes on Jr> here, all vSphere NFS datastores. This aggregate also has a slice Jr> of flash pool assigned to it, currently about 900GB usable.
Do you have compression or dedupe turned on these volumes? And how much space is your SOLR data taking?
Jr> We recently deployed some CentOS 6 VMs on these datastores that Jr> are running Solr, which is an applciation that is used for Jr> distributed indexing. The replication is done in typical a master/ Jr> slave relationship. My understanding of Solr's replication is that Jr> it is done periodically where the slaves download any new index Jr> files that exist on the master but not on the slaves, into a temp Jr> location, and then the slaves replace their existing index files Jr> with the new index files from the master. So it appears to be a Jr> mostly sequential write process.
That would imply to me that all the indexing of the documents happens on the master, and that these slaves are just querying the index. If they're copies, do you need to keep them in the data store at all? Would it be more effective to keep them purely locally to the VMs, either on local datastores on the ESX host(s), or even in memory. Though in that case I'd argue that just having one big VM with enough memory to cache the readonly index would make more sense... but reliability would suffer of course.
Jr> During the replication events, we are seeing the controller Jr> hosting this particular datastore basically getting crushed and Jr> issuing B and b CPs. Here is some output of sysstat during one of Jr> the replication events:
Can the replication events be staggered, or does each SOLR slave wake up at the same time, copy the file and then write it to it's own datastore, which I suspect is distinct for some of them?
Jr> CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk Jr> ops/s in out read write read write age hit time ty util Jr> 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% Jr> 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% Jr> 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% Jr> 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% Jr> 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% Jr> 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% Jr> 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% Jr> 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% Jr> 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% Jr> 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% Jr> 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% Jr> 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% Jr> 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% Jr> 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% Jr> 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% Jr> 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% Jr> 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% Jr> 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% Jr> 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% Jr> 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% Jr> 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% Jr> 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% Jr> 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% Jr> 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% Jr> 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% Jr> 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% Jr> 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
Jr> And here is some iostat output from one of the Solr slaves during the same timeframe:
your write numbers are simply huge! No wonder it's getting crushed. Maybe your SOLR setup can be changed so the six VMs share a data store and only one of them actually updates it, while the others are just readonly slaves doing lookups?
And do you have a flashcache? Maybe setting up a de-duped dedicated datastore for these SOLR clients would be the way to go here. How big is this data set?
Jr> 12/03/2015 06:48:36 PM Jr> avg-cpu: %user %nice %system %iowait %steal %idle Jr> 7.54 0.00 7.42 44.12 0.00 40.92
Jr> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await Jr> svctm %util Jr> sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 Jr> 0.00 62.65 Jr> sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 Jr> 100.00 Jr> dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 Jr> 50.12 62.65 Jr> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 Jr> 135.33 40.60 Jr> dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 Jr> 0.04 100.00
Jr> 12/03/2015 06:48:38 PM Jr> avg-cpu: %user %nice %system %iowait %steal %idle Jr> 9.23 0.00 16.03 24.23 0.00 50.51
Jr> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await Jr> svctm %util Jr> sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 Jr> 16.59 34.00 Jr> sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 Jr> 100.00 Jr> dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 Jr> 0.00 27.55 Jr> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 Jr> 1.82 34.00 Jr> dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 Jr> 0.01 100.00
Jr> 12/03/2015 06:48:40 PM Jr> avg-cpu: %user %nice %system %iowait %steal %idle Jr> 9.27 0.00 10.04 22.91 0.00 57.79
Jr> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await Jr> svctm %util Jr> sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 Jr> 100.05 Jr> dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 Jr> 0.04 100.05
Jr> 12/03/2015 06:48:42 PM Jr> avg-cpu: %user %nice %system %iowait %steal %idle Jr> 9.09 0.00 12.08 21.95 0.00 56.88
Jr> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await Jr> svctm %util Jr> sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 Jr> 295.33 44.30 Jr> sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 Jr> 2.17 99.95 Jr> dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 Jr> 407.50 40.75 Jr> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Jr> 0.00 0.00 Jr> dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 Jr> 10.14 3.55 Jr> dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 Jr> 0.02 100.00
Jr> As you can see, we are getting some decent throughput, but it causes the latency to spike on the Jr> filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify Jr> that? Is a 1MB block size too much for the filer? I am still researching if there is a way to Jr> modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up Jr> of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are Jr> currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, Jr> which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer Jr> so much though.
Jr> Any ideas?
Jr> -- Jr> GPG keyID: 0xFECC890C Jr> Phil Gardner
Jr> _______________________________________________ Jr> Toasters mailing list Jr> Toasters@teaparty.net Jr> http://www.teaparty.net/mailman/listinfo/toasters
I've heard of issues with the performance of flashpool in versions of CDOT under 8.3 when background cleanup tasks like snapshot deletes are done. That said, I'd open a ticket with Netapp and have them analyze it.
On Thu, Dec 3, 2015 at 2:47 PM, Philip Gardner, Jr. < phil.gardnerjr@gmail.com> wrote:
Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
This is actually 8.3. I will open up a case with NetApp eventually, just figured I would ask here first to see if anyone had any quick ideas.
filer::> version NetApp Release 8.3: Mon Mar 09 23:01:28 PDT 2015
On Thu, Dec 3, 2015 at 3:09 PM, Basil basilberntsen@gmail.com wrote:
I've heard of issues with the performance of flashpool in versions of CDOT under 8.3 when background cleanup tasks like snapshot deletes are done. That said, I'd open a ticket with Netapp and have them analyze it.
On Thu, Dec 3, 2015 at 2:47 PM, Philip Gardner, Jr. < phil.gardnerjr@gmail.com> wrote:
Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Yeah, I replied to another email but I think I forgot to copy the list. The slaves all replicate on a 3 minute staggered interval, so while they could potentially line up and write the index files at the same time, I observe the B CP when only one slave is writing.
Bummer.
On Thu, Dec 3, 2015 at 3:59 PM, basilberntsen@gmail.com wrote:
You might just be seeing the unavoidable performance of a system stretched as far as it'll go. You could improve system health by using QOS to throttle incoming writes, but that would increase host-observed latency. You could also, as mentioned, stagger your IO, if the solr side supports that kind of thing.
Sent from my BlackBerry 10 smartphone on the Bell network. *From: *Philip Gardner, Jr. *Sent: *Thursday, December 3, 2015 3:57 PM *To: *Basil *Cc: *Toasters *Subject: *Re: FAS8040 getting crushed by Solr replication
This is actually 8.3. I will open up a case with NetApp eventually, just figured I would ask here first to see if anyone had any quick ideas.
filer::> version NetApp Release 8.3: Mon Mar 09 23:01:28 PDT 2015
On Thu, Dec 3, 2015 at 3:09 PM, Basil basilberntsen@gmail.com wrote:
I've heard of issues with the performance of flashpool in versions of CDOT under 8.3 when background cleanup tasks like snapshot deletes are done. That said, I'd open a ticket with Netapp and have them analyze it.
On Thu, Dec 3, 2015 at 2:47 PM, Philip Gardner, Jr. < phil.gardnerjr@gmail.com> wrote:
Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- GPG keyID: 0xFECC890C Phil Gardner
"Jr" == Jr Gardner phil.gardnerjr@gmail.com writes:
Jr> Yeah, I replied to another email but I think I forgot to copy the Jr> list. The slaves all replicate on a 3 minute staggered interval, Jr> so while they could potentially line up and write the index files Jr> at the same time, I observe the B CP when only one slave is Jr> writing.
What's the load on the Netapp when no nodes are writing at all? Are you getting hit by lots of writes then? If so... you need more spindles. And how full is the aggregate? And how busy/full are the other volumes?
Hi Philip I wonder about partition alignment, although the disk writes look fairly symmetric with the net in. Can you provide nfsstat –d? Also, CentOS 6 should be aligned by default, although it is possible to misalign manually-created partitions if using older tools and parameters.
Peter
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Philip Gardner, Jr. Sent: Thursday, December 03, 2015 11:47 AM To: Toasters Subject: FAS8040 getting crushed by Solr replication
Hi all - I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable. We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process. During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though. Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
This is CDOT, is there a nfsstat equivalent? I am definitely well aware of misalignment issues since it has been an issue for us in the past.
These VMs are set up with a separate virtual disk for the database volume, using the device directly as an LVM physical device without a partition. Newer versions (>=EL6) of pvcreate set the start of the alignment to the 1MB boundary I believe.
On Thu, Dec 3, 2015 at 3:12 PM, Learmonth, Peter <Peter.Learmonth@netapp.com
wrote:
Hi Philip
I wonder about partition alignment, although the disk writes look fairly symmetric with the net in. Can you provide nfsstat –d?
Also, CentOS 6 should be aligned by default, although it is possible to misalign manually-created partitions if using older tools and parameters.
Peter
*From:* toasters-bounces@teaparty.net [mailto: toasters-bounces@teaparty.net] *On Behalf Of *Philip Gardner, Jr. *Sent:* Thursday, December 03, 2015 11:47 AM *To:* Toasters *Subject:* FAS8040 getting crushed by Solr replication
Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
--
GPG keyID: 0xFECC890C Phil Gardner
To check alignment:
set diag
statistics start -object lun -vserver xyz statistics stop statistics show -object lun -instance * -counter write_align_histo|read_align_histo
The output looks like this:
read_align_histo - 0 23 1 1 2 1 3 25 4 1 5 2 6 5 7 8 write_align_histo - 0 14 1 1 2 0 3 5 4 1 5 0 6 1 7 0
That's an example of some misaligned IO- the counter for reads offset by 3 is 25, in this example. Here's what a healthy one looks like:
Counter Value -------------------------------- -------------------------------- read_align_histo - 0 100 1 0 2 0 3 0 4 0 5 0 6 0 7 0 write_align_histo - 0 98 1 0 2 0 3 0 4 0 5 0 6 0 7 0
On Thu, Dec 3, 2015 at 4:02 PM, Philip Gardner, Jr. < phil.gardnerjr@gmail.com> wrote:
This is CDOT, is there a nfsstat equivalent? I am definitely well aware of misalignment issues since it has been an issue for us in the past.
These VMs are set up with a separate virtual disk for the database volume, using the device directly as an LVM physical device without a partition. Newer versions (>=EL6) of pvcreate set the start of the alignment to the 1MB boundary I believe.
On Thu, Dec 3, 2015 at 3:12 PM, Learmonth, Peter < Peter.Learmonth@netapp.com> wrote:
Hi Philip
I wonder about partition alignment, although the disk writes look fairly symmetric with the net in. Can you provide nfsstat –d?
Also, CentOS 6 should be aligned by default, although it is possible to misalign manually-created partitions if using older tools and parameters.
Peter
*From:* toasters-bounces@teaparty.net [mailto: toasters-bounces@teaparty.net] *On Behalf Of *Philip Gardner, Jr. *Sent:* Thursday, December 03, 2015 11:47 AM *To:* Toasters *Subject:* FAS8040 getting crushed by Solr replication
Hi all -
I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable.
We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process.
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35%
And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though.
Any ideas?
--
GPG keyID: 0xFECC890C Phil Gardner
-- GPG keyID: 0xFECC890C Phil Gardner
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Yeah, but that’s for LUNs. There isn’t an equivalent for NFS/files that I know of.
However, if you’re using LVM on a bare disk with not partition, it should be aligned with offset 0. And yes, newer distros align on 1MB increments which is nice and aligned.
Peter
From: Basil [mailto:basilberntsen@gmail.com] Sent: Thursday, December 03, 2015 1:22 PM To: Philip Gardner, Jr. Cc: Learmonth, Peter; Toasters Subject: Re: FAS8040 getting crushed by Solr replication
To check alignment:
set diag
statistics start -object lun -vserver xyz statistics stop statistics show -object lun -instance * -counter write_align_histo|read_align_histo
The output looks like this:
read_align_histo - 0 23 1 1 2 1 3 25 4 1 5 2 6 5 7 8 write_align_histo - 0 14 1 1 2 0 3 5 4 1 5 0 6 1 7 0
That's an example of some misaligned IO- the counter for reads offset by 3 is 25, in this example. Here's what a healthy one looks like:
Counter Value -------------------------------- -------------------------------- read_align_histo - 0 100 1 0 2 0 3 0 4 0 5 0 6 0 7 0 write_align_histo - 0 98 1 0 2 0 3 0 4 0 5 0 6 0 7 0
On Thu, Dec 3, 2015 at 4:02 PM, Philip Gardner, Jr. <phil.gardnerjr@gmail.commailto:phil.gardnerjr@gmail.com> wrote: This is CDOT, is there a nfsstat equivalent? I am definitely well aware of misalignment issues since it has been an issue for us in the past. These VMs are set up with a separate virtual disk for the database volume, using the device directly as an LVM physical device without a partition. Newer versions (>=EL6) of pvcreate set the start of the alignment to the 1MB boundary I believe.
On Thu, Dec 3, 2015 at 3:12 PM, Learmonth, Peter <Peter.Learmonth@netapp.commailto:Peter.Learmonth@netapp.com> wrote: Hi Philip I wonder about partition alignment, although the disk writes look fairly symmetric with the net in. Can you provide nfsstat –d? Also, CentOS 6 should be aligned by default, although it is possible to misalign manually-created partitions if using older tools and parameters.
Peter
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] On Behalf Of Philip Gardner, Jr. Sent: Thursday, December 03, 2015 11:47 AM To: Toasters Subject: FAS8040 getting crushed by Solr replication
Hi all - I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable. We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process. During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35% And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00
As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though. Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
-- GPG keyID: 0xFECC890C Phil Gardner
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I don't think there's going to be a quick fix, but this is the sort of thing that is comparatively easy to diagnose with a perfstat and a support case. If there's a storage bottleneck, it should jump right out at us.
The harder question is what to do about it. In some cases it's pretty obvious the system is hitting the wall on pure IO. In other cases, there are strange things in the IO patterns and the overall IO can be optimized.
Looking at the write throughput, I would guess you're closing in on the write limits of this system.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Learmonth, Peter Sent: Friday, December 04, 2015 12:38 AM To: Basil; Philip Gardner, Jr. Cc: Toasters Subject: RE: FAS8040 getting crushed by Solr replication
Yeah, but that’s for LUNs. There isn’t an equivalent for NFS/files that I know of.
However, if you’re using LVM on a bare disk with not partition, it should be aligned with offset 0. And yes, newer distros align on 1MB increments which is nice and aligned.
Peter
From: Basil [mailto:basilberntsen@gmail.com] Sent: Thursday, December 03, 2015 1:22 PM To: Philip Gardner, Jr. Cc: Learmonth, Peter; Toasters Subject: Re: FAS8040 getting crushed by Solr replication
To check alignment:
set diag
statistics start -object lun -vserver xyz statistics stop statistics show -object lun -instance * -counter write_align_histo|read_align_histo
The output looks like this:
read_align_histo - 0 23 1 1 2 1 3 25 4 1 5 2 6 5 7 8 write_align_histo - 0 14 1 1 2 0 3 5 4 1 5 0 6 1 7 0
That's an example of some misaligned IO- the counter for reads offset by 3 is 25, in this example. Here's what a healthy one looks like:
Counter Value -------------------------------- -------------------------------- read_align_histo - 0 100 1 0 2 0 3 0 4 0 5 0 6 0 7 0 write_align_histo - 0 98 1 0 2 0 3 0 4 0 5 0 6 0 7 0
On Thu, Dec 3, 2015 at 4:02 PM, Philip Gardner, Jr. <phil.gardnerjr@gmail.commailto:phil.gardnerjr@gmail.com> wrote: This is CDOT, is there a nfsstat equivalent? I am definitely well aware of misalignment issues since it has been an issue for us in the past. These VMs are set up with a separate virtual disk for the database volume, using the device directly as an LVM physical device without a partition. Newer versions (>=EL6) of pvcreate set the start of the alignment to the 1MB boundary I believe.
On Thu, Dec 3, 2015 at 3:12 PM, Learmonth, Peter <Peter.Learmonth@netapp.commailto:Peter.Learmonth@netapp.com> wrote: Hi Philip I wonder about partition alignment, although the disk writes look fairly symmetric with the net in. Can you provide nfsstat –d? Also, CentOS 6 should be aligned by default, although it is possible to misalign manually-created partitions if using older tools and parameters.
Peter
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] On Behalf Of Philip Gardner, Jr. Sent: Thursday, December 03, 2015 11:47 AM To: Toasters Subject: FAS8040 getting crushed by Solr replication
Hi all - I have an 8040 with 88 x 900G 10k disks, all assigned to a single aggregate on one of the controllers. There are a few volumes on here, all vSphere NFS datastores. This aggregate also has a slice of flash pool assigned to it, currently about 900GB usable. We recently deployed some CentOS 6 VMs on these datastores that are running Solr, which is an applciation that is used for distributed indexing. The replication is done in typical a master/slave relationship. My understanding of Solr's replication is that it is done periodically where the slaves download any new index files that exist on the master but not on the slaves, into a temp location, and then the slaves replace their existing index files with the new index files from the master. So it appears to be a mostly sequential write process. During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk ops/s in out read write read write age hit time ty util 7% 854 60795 38643 14108 107216 0 0 20 96% 100% : 9% 7% 991 61950 41930 6542 89350 0 0 20 95% 100% : 9% 4% 977 62900 38820 1244 2932 0 0 20 93% 9% : 1% 4% 811 52853 35658 76 12 0 0 20 96% 0% - 1% 5% 961 67428 43600 60 12 0 0 20 97% 0% - 1% 4% 875 57204 41222 66 4 0 0 20 97% 0% - 1% 5% 1211 78933 59481 110 12 0 0 20 97% 0% - 1% 16% 1024 55549 31785 33306 84626 0 0 20 97% 89% T 14% 7% 1164 56356 36122 14830 102808 0 0 20 96% 100% : 8% 49% 13991 909816 56134 3926 62136 0 0 24 82% 100% B 7% 78% 13154 842333 55302 53011 868408 0 0 24 83% 100% : 51% 83% 12758 818914 59706 44897 742156 0 0 23 89% 97% F 45% 84% 11997 765669 53760 64084 958309 0 0 26 89% 100% B 59% 80% 11823 725972 46004 73227 867704 0 0 26 88% 100% B 51% 83% 15125 957531 46144 42439 614295 0 0 23 87% 100% B 36% 74% 9584 612985 42404 67147 839408 0 0 24 93% 100% B 48% 78% 11367 751672 64071 49881 770340 0 0 24 88% 100% B 46% 79% 12468 822736 53757 38995 595721 0 0 24 87% 100% # 34% 56% 6315 396022 48623 42597 601630 0 0 24 94% 100% B 35% 67% 7923 554797 56459 26309 715759 0 0 24 87% 100% # 43% 69% 13719 879990 37401 41532 333768 0 0 22 87% 100% B 22% 45% 24 52946 42826 33186 736345 0 0 22 98% 100% # 41% 72% 13909 888007 46266 29109 485422 0 0 22 87% 100% B 28% 59% 8036 523206 53199 41719 646767 0 0 22 90% 100% B 37% 68% 7336 505544 63590 46602 870744 0 0 22 91% 100% B 49% 71% 12673 809175 29070 21208 556669 0 0 6 89% 100% # 38% 70% 12097 726574 49381 36251 588939 0 0 24 90% 100% B 35% And here is some iostat output from one of the Solr slaves during the same timeframe:
12/03/2015 06:48:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 7.54 0.00 7.42 44.12 0.00 40.92
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.50 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 62.65 sdb 0.00 26670.00 0.00 190.50 0.00 95.25 1024.00 162.75 214.87 5.25 100.00 dm-0 0.00 0.00 1.00 11.50 0.00 0.04 8.00 5.59 0.00 50.12 62.65 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.00 0.00 0.01 8.00 2.44 0.00 135.33 40.60 dm-3 0.00 0.00 0.00 26880.00 0.00 105.00 8.00 20828.90 194.77 0.04 100.00
12/03/2015 06:48:38 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.23 0.00 16.03 24.23 0.00 50.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 177.00 1.00 19.50 0.00 0.79 78.83 7.91 651.90 16.59 34.00 sdb 0.00 73729.00 0.00 599.50 0.00 299.52 1023.23 142.51 389.81 1.67 100.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 0.00 0.00 27.55 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 186.50 0.00 0.73 8.00 87.75 483.59 1.82 34.00 dm-3 0.00 0.00 0.00 74310.00 0.00 290.27 8.00 18224.54 402.32 0.01 100.00
12/03/2015 06:48:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.27 0.00 10.04 22.91 0.00 57.79
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 24955.50 0.00 202.00 0.00 101.00 1024.00 142.07 866.56 4.95 100.05 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 25151.50 0.00 98.25 8.00 18181.29 890.67 0.04 100.05
12/03/2015 06:48:42 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.09 0.00 12.08 21.95 0.00 56.88
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.50 0.00 1.50 0.00 0.01 18.67 0.46 36.33 295.33 44.30 sdb 0.00 59880.50 0.00 461.50 0.00 230.75 1024.00 144.82 173.12 2.17 99.95 dm-0 0.00 0.00 0.00 1.00 0.00 0.00 8.00 0.81 0.00 407.50 40.75 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 3.50 0.00 0.01 8.00 0.13 37.29 10.14 3.55 dm-3 0.00 0.00 0.00 60352.50 0.00 235.75 8.00 18538.70 169.30 0.02 100.00 As you can see, we are getting some decent throughput, but it causes the latency to spike on the filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify that? Is a 1MB block size too much for the filer? I am still researching if there is a way to modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication, which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer so much though. Any ideas?
-- GPG keyID: 0xFECC890C Phil Gardner
-- GPG keyID: 0xFECC890C Phil Gardner
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Philip Gardner, Jr. wrote:
During the replication events, we are seeing the controller hosting this particular datastore basically getting crushed and issuing B and b CPs. Here is some output of sysstat during one of the replication events:
Many/most ppl here prob already know this, but just to be ultra safe: there's no difference in practice w.r.t. badness between a 'B' (back-2-back CP) and a 'b' (deferred back-2-back CP).
Exacly what triggers a 'b' vs a 'B' I don't know.
It's not that the distinction is unimportant, it is. Just they're equally bad for performance
/M