Recently, we encountered a serious problem in a high performance NFS
environment using large Solaris servers as NFS clients. Due to the
fantastic assistance of several people who have elected to not be named,
we have identified the problem and a resolution for it. Since others
on this list are likely to use Solaris servers as NFS clients in
high performance environments, it was thought that this information
might be generally useful.
Situation and Symptoms:
We were running on E class Sun's using Solaris 2.6 with what we thought
were well considered kernel tuning parameters (cranked up nfs:nfs_max_threads,
maxusers, rlim_fd_{max,curr}). The disks were using Sun's Gigabit Ethernet
interfaces connecting to NetApp F760s using a high quality Gb Enet switch.
When we turned the application on, we got stunningly low throughput
numbers, and couldn't figure out why.
Client CPU load was very good, most of the processing power was idle (and
not in WAIT state), memory utilization was low, the filers were running
at low utilization, there were no network or I/O bottlenecks in evidence.
We were running two applications that do reads, writes, appends, creates,
and deletes on relatively small files over NFS.
One of these applications, and both have fairly similar characteristics
and perform computations on the same file sets, was running just fine.
The other was running well at light loads, but had horrible problems
as the load went up. It appeared that some fixed resource was being
consumed, and once we went over a certain load threshold, the number
of processes grew exponentially while the amount of work being done
remained constant. Eventually, these extra processes exhausted main
memory and the machine began to thrash.
Solution:
Since we don't have access to Sun source, we can't be 100% certain of
what was happening, but this is the best information we have. We think
what we describe here is exactly what's going on, but there might be
some minor variations. If someone here knows Sun internals, maybe they can
fill in the gaps.
Basically, it seems that we were having problems with the Directory Name
Lookup Cache (DNLC) in Solaris. It seems that between Solaris 2.5.1
and Solaris 2.6 the math was changed on how this was calculated. Here's
the math, as best I know it:
2.5.1 DNLC size = (max_nprocs + 16 + maxusers) + 64
2.6 DNLC size = 4 * (max_nprocs + maxusers) + 320
According to Cockroft's Sun Performance Tuning book,
max_nprocs = 10 + 16 * maxusers
We don't know if this calculation has changed between 2.5.1 and 2.6
Wanting large numbers of processes and large buffer sizes, we set
maxusers=2048. This means that for Solaris 2.5.1, the DNLC size would
be 34906 (or so) and for Solaris 2.6, the same maxusers variable would
yield a size of 139624.
Now, as we understand it, this is a LINEAR lookup table. Further, when
deleting a file, an entry in this table must be looked up and locked.
This was the distinction between our two processes, one does a single
delete before completion, the other does three deletes. With a table
this size, there seems to be a finite number of deletes/second one can
perform over NFS, and we hit that limit. We put "set ncsize = 8192"
in /etc/system, rebooted, and the problem went away. We played with
sizes ranging from 4096 to 32768 and saw no huge performance difference,
(8192 SEEMED best, 32768 SEEMED worst, but that was a VERY subjective
evaluation), and we saw no significant difference to our DNLC cache hit
rate as measured by "vmstat -s".
Additional Information:
One way you can look to see if you're experiencing this problem is to
run "/usr/ucb/ps -uaxl" and look at the WCHAN column. To the best of
our knowledge, Sun doesn't publish a translation guide to these event
names, but we have it on good authority that if you see one called
"nc_rele_" (which is a truncation of "nc_rele_lock") the process is
waiting for DNLC entries to become unlocked. Note: On healthy machines
processes will sometimes show up in this state, but if a significant
percentage of them are in this state, that may indicate this problem.
No, I can't accurately define "significant state". I doubt 1% is a
problem, I know that 30% or higher is a problem.
Also, Sun has a BugId for this problem, 4212925. As of August 2, they
have a patch for 2.6, numbered 105720-08. It seems to do a lot
of things, but the explanation of it isn't as revealing as we'd like,
so we're a little leery of putting it into production without extensive
testing. We're playing with it, but can't comment on it's efficacy at
this time. It might hash the DNLC table (That would be the right solution,
and word is this will happen in 2.8), but for all I know it may just revert
to 2.5.1's DNLC math.
Summary:
If you're running Solaris 2.6 or 2.7 in a high performance environment where
a lot of files are being deleted over NFS, make sure your DNLC is not
too large or you'll have HUGE problems. Trust me.
Hope this helps someone.
--
Nick Christenson
npc(a)sendmail.com