Recently, we encountered a serious problem in a high performance NFS environment using large Solaris servers as NFS clients. Due to the fantastic assistance of several people who have elected to not be named, we have identified the problem and a resolution for it. Since others on this list are likely to use Solaris servers as NFS clients in high performance environments, it was thought that this information might be generally useful.
Situation and Symptoms:
We were running on E class Sun's using Solaris 2.6 with what we thought were well considered kernel tuning parameters (cranked up nfs:nfs_max_threads, maxusers, rlim_fd_{max,curr}). The disks were using Sun's Gigabit Ethernet interfaces connecting to NetApp F760s using a high quality Gb Enet switch. When we turned the application on, we got stunningly low throughput numbers, and couldn't figure out why.
Client CPU load was very good, most of the processing power was idle (and not in WAIT state), memory utilization was low, the filers were running at low utilization, there were no network or I/O bottlenecks in evidence. We were running two applications that do reads, writes, appends, creates, and deletes on relatively small files over NFS.
One of these applications, and both have fairly similar characteristics and perform computations on the same file sets, was running just fine. The other was running well at light loads, but had horrible problems as the load went up. It appeared that some fixed resource was being consumed, and once we went over a certain load threshold, the number of processes grew exponentially while the amount of work being done remained constant. Eventually, these extra processes exhausted main memory and the machine began to thrash.
Solution:
Since we don't have access to Sun source, we can't be 100% certain of what was happening, but this is the best information we have. We think what we describe here is exactly what's going on, but there might be some minor variations. If someone here knows Sun internals, maybe they can fill in the gaps.
Basically, it seems that we were having problems with the Directory Name Lookup Cache (DNLC) in Solaris. It seems that between Solaris 2.5.1 and Solaris 2.6 the math was changed on how this was calculated. Here's the math, as best I know it:
2.5.1 DNLC size = (max_nprocs + 16 + maxusers) + 64 2.6 DNLC size = 4 * (max_nprocs + maxusers) + 320
According to Cockroft's Sun Performance Tuning book,
max_nprocs = 10 + 16 * maxusers
We don't know if this calculation has changed between 2.5.1 and 2.6
Wanting large numbers of processes and large buffer sizes, we set maxusers=2048. This means that for Solaris 2.5.1, the DNLC size would be 34906 (or so) and for Solaris 2.6, the same maxusers variable would yield a size of 139624.
Now, as we understand it, this is a LINEAR lookup table. Further, when deleting a file, an entry in this table must be looked up and locked. This was the distinction between our two processes, one does a single delete before completion, the other does three deletes. With a table this size, there seems to be a finite number of deletes/second one can perform over NFS, and we hit that limit. We put "set ncsize = 8192" in /etc/system, rebooted, and the problem went away. We played with sizes ranging from 4096 to 32768 and saw no huge performance difference, (8192 SEEMED best, 32768 SEEMED worst, but that was a VERY subjective evaluation), and we saw no significant difference to our DNLC cache hit rate as measured by "vmstat -s".
Additional Information:
One way you can look to see if you're experiencing this problem is to run "/usr/ucb/ps -uaxl" and look at the WCHAN column. To the best of our knowledge, Sun doesn't publish a translation guide to these event names, but we have it on good authority that if you see one called "nc_rele_" (which is a truncation of "nc_rele_lock") the process is waiting for DNLC entries to become unlocked. Note: On healthy machines processes will sometimes show up in this state, but if a significant percentage of them are in this state, that may indicate this problem. No, I can't accurately define "significant state". I doubt 1% is a problem, I know that 30% or higher is a problem.
Also, Sun has a BugId for this problem, 4212925. As of August 2, they have a patch for 2.6, numbered 105720-08. It seems to do a lot of things, but the explanation of it isn't as revealing as we'd like, so we're a little leery of putting it into production without extensive testing. We're playing with it, but can't comment on it's efficacy at this time. It might hash the DNLC table (That would be the right solution, and word is this will happen in 2.8), but for all I know it may just revert to 2.5.1's DNLC math.
Summary:
If you're running Solaris 2.6 or 2.7 in a high performance environment where a lot of files are being deleted over NFS, make sure your DNLC is not too large or you'll have HUGE problems. Trust me.
Hope this helps someone.