We have discovered an interesting feature on our toasters, and after some consideration have come to the conclusion that it is a misfeature.
As many of you are aware the raid.reconstruct_speed number determines how much CPU to use in raid reconstruction. This directly controls how long it takes to reconstruct a missing disk. If you are concerned about double disk failure, you set this number higher.
Unfortunately this number also controls how much CPU is to be used in raid scrubbing, to make things worse if you have multiple raid volumes they seem to be scrubbed in parallel not sequentially.
These are the numbers that we obtained, 540 2 raid groups 0 (20X4GB Wide) 1 (12 X 4GB Narrow)
no scrubbing CPU = 6% raid.reconstruct_speed =1 CPU = 11-12% raid.reconstruct_speed =2 CPU = 29-33% raid.reconstruct_speed =3 CPU = 59-65%
With raid.reconstruct_speed = 8 the filler didn't seem to be responding to nfs requests, which is how discovered this.
Now clearly this affects lower end machines (540) more than faster machines. But since scrubbing seems to be very CPU bound it would seem to us that almost any machine could be brought to its knees by this condition.
Our question is this: Would it be reasonable to ask Netapp for one or both of the following: (1) Separate raid reconstruct speed from raid scrub speed. (2) Add an option to force raid scrubbing to be sequential. perhaps raid.scrub_parallel == off
Clearly for those concerned about dual disk failures a high raid.reconstruct_speed is important. Equally clearly moderatly high raid.reconstruct_speed can bring a 540 to its knees with 2 Raid groups. Spliting the speeds or, running raid scrubbing sequentially might avoid this.
----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU