I have not opened a case with Netapp yet but probably will if no one has any good ideas; I just like to pick people's brains before going official. Thanks for any input.
A few months ago we moved a file share off a Windows server onto our FAS3040 Netapp running 7.2.4 and shared it out via CIFS. It contains software install files and scripts and depending on scheduled jobs, it can get hit pretty hard and pushes out approximately 1 Gbit/sec, which has been drastically affecting the service times for our other shares on that filer, and its namely response-sensitive NFS shares we care about that are affected the most such as mail and web files. It doesn't really seem to be a disk bottleneck because the disk read/sec in sysstat is usually only half of what the filer is pushing out to the network, so I assume its reading some data from cache. The CIFS software install share can either get hit by 1-60+ CIFS clients where each client reads files on and off for hours at a time, or sometimes we have hundreds of clients hitting the share at once for a smaller set of files (such as to update one software package across a large set of PCs). I've been able to reproduce the slowdown with just 4 CIFS clients on gigabit downloading a large file from the share. Sometimes it only causes a modest slowdown in the NFS response time but sometimes email messages being moved between folders will stall for 8 seconds or much more, which is pretty much unacceptable. I don't think its a bottleneck in my core network because I've done tests where the slow nfs client is on the same switch as the filer, which is connected via two gig links using LACP. Also, in the normal situation where the slowdown is encountered, mail (NFS) traffic is flowing through a different gig uplink than the hungry CIFS clients.
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept slower data rates, and I'd rather not run away from the problem by avoiding it but rather learn what I can do to prevent my filer from being held hostage by greedy clients. I do have another 3040 I could move the share to, but that filer also has volumes that would be affected negatively in the same way, and I'd rather not concede defeat and go back to hosting the share on a dedicated windows server. I can try different code versions in a test environment if I need to, but I'd like to think this kind of situation would have come up already and have a solution at hand.
I've played around with na_priority trying to set the mail and website volumes to high or veryhigh priority and the software share to low or verylow but that isn't making a measurable impact. I'm not really sure what to tweak or check next.
Here is an example from sysstat when I am simulating the slowdown condition with 4 CIFS clients on gigabit fetching the same file.
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 6% 2058 167 0 751 1543 2196 0 0 0 11 6% 2590 164 0 699 2238 2904 32 0 0 11 10% 2183 223 0 1241 4471 5072 17872 0 0 11 11% 3299 799 0 1577 22194 4935 1183 0 0 11 22% 3298 3072 0 3005 107869 9128 24 0 0 11 18% 2532 1986 0 2270 87651 2078 0 0 0 11 18% 2198 2200 0 1696 105941 8032 8 0 0 11 16% 3597 1650 0 1890 84691 3528 24 0 0 11 23% 4946 2216 0 2604 112741 14664 0 0 0 11 22% 4075 2041 0 2324 100380 21568 0 0 0 11 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 21% 3272 2246 0 2862 115380 4688 24 0 0 11 21% 4117 2092 0 2686 109165 3864 8 0 0 11 26% 4188 2136 0 3436 115081 21900 0 0 0 11 ......(skip) 30% 7487 1773 0 4261 93385 10156 3328 0 0 6 25% 4566 1900 0 3339 96655 13764 9808 0 0 7 24% 2965 2202 0 2477 111493 11772 5475 0 0 8 23% 5256 1986 0 3093 102409 10508 24 0 0 8 19% 2979 2068 0 1810 102282 9926 0 0 0 8 20% 3164 2323 0 2301 111209 1560 8 0 0 8 23% 7082 2165 0 2322 103816 2292 24 0 0 8 22% 11780 1158 0 2763 55501 1760 0 0 0 8 20% 12032 675 0 3820 36504 2452 0 0 0 8 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 23% 16269 1122 0 3914 54034 4460 24 0 0 6 18% 8991 1030 0 2739 48400 4568 8 0 0 6 10% 3903 237 0 1346 4494 3828 0 0 0 6 11% 3912 219 0 1623 4301 3808 6508 0 0 6 8% 2402 224 0 868 2027 2744 8712 0 0 6
Adam McDougall wrote:
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept
Adam -
I'd suggest taking a look at FlexShare (available since 7.2.x at no additional cost) which has been developed for exactly this problem.
It ONLY kicks in when there is contention of resources (eg. CPU, memory)
Prioritise the NFS workloads to high, and either leave the CIFS workload as is or set to low.
Regards,
Pat
Exactly what FlexShare is for... just keep in mind that disk iops *are* restricted based on the prioritization regardless of the load (it's a non-work-conserving queue), so monitor the appropriate statistics (I'm research inhibited at the moment, but prioqueue:usr_wait_msecs is close) to make sure things aren't waiting unnecessarily if there is still bandwidth to disk available.
--greck
On Feb 8, 2009, at 6:55 PM, "Pat Breen" pat.breen@netapp.com wrote:
Adam McDougall wrote:
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept
Adam -
I'd suggest taking a look at FlexShare (available since 7.2.x at no additional cost) which has been developed for exactly this problem.
It ONLY kicks in when there is contention of resources (eg. CPU, memory)
Prioritise the NFS workloads to high, and either leave the CIFS workload as is or set to low.
Regards,
Pat
As long as the CIFS share and the NFS exports are not in the same volume ... FlexShare may be perfect for your needs.
With respect to what Greck was talking about ... in order to determine if disk iops are being limited by FlexShare, you'll want to use "stats" and observe the priorityqueue object, and pay attention to: priorityqueue:(default):usr_read_limit_hit:0 <-- user disk iops priorityqueue:(default):sys_read_limit_hit:0 <-- system disk iops
If those values become non-zero, you'll want to increase the global "io_concurrency": filer> priority set io_concurrency=<some number>
-- defaults to 8, max is 1024.
BTW, you need to be in advanced priv for this object:
filer> priv set advanced filer*> stats start -I foo priorityqueue wait 30 seconds or so filer*> starts stop -I foo
-----Original Message----- From: Cannon, Greck Sent: Sunday, February 08, 2009 9:40 PM To: Breen, Pat Cc: Adam McDougall; toasters@mathworks.com Subject: Re: How to balance volume priority (some NFS vs. CIFS)
Exactly what FlexShare is for... just keep in mind that disk iops *are* restricted based on the prioritization regardless of the load (it's a non-work-conserving queue), so monitor the appropriate statistics (I'm research inhibited at the moment, but prioqueue:usr_wait_msecs is close) to make sure things aren't waiting unnecessarily if there is still bandwidth to disk available.
--greck
On Feb 8, 2009, at 6:55 PM, "Pat Breen" pat.breen@netapp.com wrote:
Adam McDougall wrote:
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept
Adam -
I'd suggest taking a look at FlexShare (available since 7.2.x at no additional cost) which has been developed for exactly this problem.
It ONLY kicks in when there is contention of resources (eg. CPU, memory)
Prioritise the NFS workloads to high, and either leave the CIFS workload as is or set to low.
Regards,
Pat
Pat Breen wrote:
Adam McDougall wrote:
Goal: reduce the impact of greedy clients (primarily known ones, but hopefully unexpected ones too) on the response time of the rest of the filer's clients. I don't care if the CIFS software share must accept
Adam -
I'd suggest taking a look at FlexShare (available since 7.2.x at no additional cost) which has been developed for exactly this problem.
It ONLY kicks in when there is contention of resources (eg. CPU, memory)
Prioritise the NFS workloads to high, and either leave the CIFS workload as is or set to low.
Regards,
Pat
As I understand it, FlexShare is the same thing as na_priority which I already tried with no obvious results. I wondered if I might need to restart CIFS or anything else to activate the changes; I "enabled" priority and set some priorities. "win" is the software share I spoke of.
priority show volume
Volume Priority Relative Sys Priority Service Priority (vs User) home on High Low mail on VeryHigh Low scratch on VeryLow Low sites on High Low win on VeryLow Medium
Adam McDougall wrote:
Here is an example from sysstat when I am simulating the slowdown condition with 4 CIFS clients on gigabit fetching the same file.
Are the nfs and cifs clients on the same 1 Gbit link?
I see that you are pushing over 100 MB/s over the network and if that is only 1 link then that seems to me to be reason for the slow response times. My advice would be to put the nfs and cifs traffic on different 1 Gbit links.