I have not opened a case with Netapp yet but probably will if no one has any good ideas; I just like to pick people's brains before going official. Thanks for any input.
A few months ago we moved a file share off a Windows server onto our FAS3040 Netapp running 7.2.4 and shared it out via CIFS. It contains software install files and scripts and depending on scheduled jobs, it can get hit pretty hard and pushes out approximately 1 Gbit/sec, which has been drastically affecting the service times for our other shares on that filer, and its namely response-sensitive NFS shares we care about that are affected the most such as mail and web files. It doesn't really seem to be a disk bottleneck because the disk read/sec in sysstat is usually only half of what the filer is pushing out to the network, so I assume its reading some data from cache. The CIFS software install share can either get hit by 1-60+ CIFS clients where each client reads files on and off for hours at a time, or sometimes we have hundreds of clients hitting the share at once for a smaller set of files (such as to update one software package across a large set of PCs). I've been able to reproduce the slowdown with just 4 CIFS clients on gigabit downloading a large file from the share. Sometimes it only causes a modest slowdown in the NFS response time but sometimes email messages being moved between folders will stall for 8 seconds or much more, which is pretty much unacceptable. I don't think its a bottleneck in my core network because I've done tests where the slow nfs client is on the same switch as the filer, which is connected via two gig links using LACP. Also, in the normal situation where the slowdown is encountered, mail (NFS) traffic is flowing through a different gig uplink than the hungry CIFS clients.
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept slower data rates, and I'd rather not run away from the problem by avoiding it but rather learn what I can do to prevent my filer from being held hostage by greedy clients. I do have another 3040 I could move the share to, but that filer also has volumes that would be affected negatively in the same way, and I'd rather not concede defeat and go back to hosting the share on a dedicated windows server. I can try different code versions in a test environment if I need to, but I'd like to think this kind of situation would have come up already and have a solution at hand.
I've played around with na_priority trying to set the mail and website volumes to high or veryhigh priority and the software share to low or verylow but that isn't making a measurable impact. I'm not really sure what to tweak or check next.
Here is an example from sysstat when I am simulating the slowdown condition with 4 CIFS clients on gigabit fetching the same file.
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 6% 2058 167 0 751 1543 2196 0 0 0 11 6% 2590 164 0 699 2238 2904 32 0 0 11 10% 2183 223 0 1241 4471 5072 17872 0 0 11 11% 3299 799 0 1577 22194 4935 1183 0 0 11 22% 3298 3072 0 3005 107869 9128 24 0 0 11 18% 2532 1986 0 2270 87651 2078 0 0 0 11 18% 2198 2200 0 1696 105941 8032 8 0 0 11 16% 3597 1650 0 1890 84691 3528 24 0 0 11 23% 4946 2216 0 2604 112741 14664 0 0 0 11 22% 4075 2041 0 2324 100380 21568 0 0 0 11 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 21% 3272 2246 0 2862 115380 4688 24 0 0 11 21% 4117 2092 0 2686 109165 3864 8 0 0 11 26% 4188 2136 0 3436 115081 21900 0 0 0 11 ......(skip) 30% 7487 1773 0 4261 93385 10156 3328 0 0 6 25% 4566 1900 0 3339 96655 13764 9808 0 0 7 24% 2965 2202 0 2477 111493 11772 5475 0 0 8 23% 5256 1986 0 3093 102409 10508 24 0 0 8 19% 2979 2068 0 1810 102282 9926 0 0 0 8 20% 3164 2323 0 2301 111209 1560 8 0 0 8 23% 7082 2165 0 2322 103816 2292 24 0 0 8 22% 11780 1158 0 2763 55501 1760 0 0 0 8 20% 12032 675 0 3820 36504 2452 0 0 0 8 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 23% 16269 1122 0 3914 54034 4460 24 0 0 6 18% 8991 1030 0 2739 48400 4568 8 0 0 6 10% 3903 237 0 1346 4494 3828 0 0 0 6 11% 3912 219 0 1623 4301 3808 6508 0 0 6 8% 2402 224 0 868 2027 2744 8712 0 0 6