Recently, we encountered a serious problem in a high performance NFS environment using large Solaris servers as NFS clients. Due to the fantastic assistance of several people who have elected to not be named, we have identified the problem and a resolution for it. Since others on this list are likely to use Solaris servers as NFS clients in high performance environments, it was thought that this information might be generally useful.
Situation and Symptoms:
We were running on E class Sun's using Solaris 2.6 with what we thought were well considered kernel tuning parameters (cranked up nfs:nfs_max_threads, maxusers, rlim_fd_{max,curr}). The disks were using Sun's Gigabit Ethernet interfaces connecting to NetApp F760s using a high quality Gb Enet switch. When we turned the application on, we got stunningly low throughput numbers, and couldn't figure out why.
Client CPU load was very good, most of the processing power was idle (and not in WAIT state), memory utilization was low, the filers were running at low utilization, there were no network or I/O bottlenecks in evidence. We were running two applications that do reads, writes, appends, creates, and deletes on relatively small files over NFS.
One of these applications, and both have fairly similar characteristics and perform computations on the same file sets, was running just fine. The other was running well at light loads, but had horrible problems as the load went up. It appeared that some fixed resource was being consumed, and once we went over a certain load threshold, the number of processes grew exponentially while the amount of work being done remained constant. Eventually, these extra processes exhausted main memory and the machine began to thrash.
Solution:
Since we don't have access to Sun source, we can't be 100% certain of what was happening, but this is the best information we have. We think what we describe here is exactly what's going on, but there might be some minor variations. If someone here knows Sun internals, maybe they can fill in the gaps.
Basically, it seems that we were having problems with the Directory Name Lookup Cache (DNLC) in Solaris. It seems that between Solaris 2.5.1 and Solaris 2.6 the math was changed on how this was calculated. Here's the math, as best I know it:
2.5.1 DNLC size = (max_nprocs + 16 + maxusers) + 64 2.6 DNLC size = 4 * (max_nprocs + maxusers) + 320
According to Cockroft's Sun Performance Tuning book,
max_nprocs = 10 + 16 * maxusers
We don't know if this calculation has changed between 2.5.1 and 2.6
Wanting large numbers of processes and large buffer sizes, we set maxusers=2048. This means that for Solaris 2.5.1, the DNLC size would be 34906 (or so) and for Solaris 2.6, the same maxusers variable would yield a size of 139624.
Now, as we understand it, this is a LINEAR lookup table. Further, when deleting a file, an entry in this table must be looked up and locked. This was the distinction between our two processes, one does a single delete before completion, the other does three deletes. With a table this size, there seems to be a finite number of deletes/second one can perform over NFS, and we hit that limit. We put "set ncsize = 8192" in /etc/system, rebooted, and the problem went away. We played with sizes ranging from 4096 to 32768 and saw no huge performance difference, (8192 SEEMED best, 32768 SEEMED worst, but that was a VERY subjective evaluation), and we saw no significant difference to our DNLC cache hit rate as measured by "vmstat -s".
Additional Information:
One way you can look to see if you're experiencing this problem is to run "/usr/ucb/ps -uaxl" and look at the WCHAN column. To the best of our knowledge, Sun doesn't publish a translation guide to these event names, but we have it on good authority that if you see one called "nc_rele_" (which is a truncation of "nc_rele_lock") the process is waiting for DNLC entries to become unlocked. Note: On healthy machines processes will sometimes show up in this state, but if a significant percentage of them are in this state, that may indicate this problem. No, I can't accurately define "significant state". I doubt 1% is a problem, I know that 30% or higher is a problem.
Also, Sun has a BugId for this problem, 4212925. As of August 2, they have a patch for 2.6, numbered 105720-08. It seems to do a lot of things, but the explanation of it isn't as revealing as we'd like, so we're a little leery of putting it into production without extensive testing. We're playing with it, but can't comment on it's efficacy at this time. It might hash the DNLC table (That would be the right solution, and word is this will happen in 2.8), but for all I know it may just revert to 2.5.1's DNLC math.
Summary:
If you're running Solaris 2.6 or 2.7 in a high performance environment where a lot of files are being deleted over NFS, make sure your DNLC is not too large or you'll have HUGE problems. Trust me.
Hope this helps someone.
On Tue, 10 Aug 1999, Nick Christenson wrote:
Also, Sun has a BugId for this problem, 4212925. As of August 2, they have a patch for 2.6, numbered 105720-08. It seems to do a lot of things, but the explanation of it isn't as revealing as we'd like, so we're a little leery of putting it into production without extensive testing. We're playing with it, but can't comment on it's efficacy at this time. It might hash the DNLC table (That would be the right solution, and word is this will happen in 2.8), but for all I know it may just revert to 2.5.1's DNLC math.
Summary:
If you're running Solaris 2.6 or 2.7 in a high performance environment where a lot of files are being deleted over NFS, make sure your DNLC is not too large or you'll have HUGE problems. Trust me.
We went through this a few months ago, trying to migrate from Solaris 2.5.1 to 2.6. The larger ncsize in 2.6 is, in fact, the heart of the problem. In Solaris 2.5.1, ncsize maxes out at 17498. In Solaris 2.6, ncsize maxes out above 60000.
The Solaris NFS client in both of these versions does a linear scan of the cache every time you delete a file, looking for other entries which may have been links to the deleted file. The DNLC is hashed by pathname, not by vnode.
As of 105720-08, the NFS client checks the link count on the file being deleted. If the link count is 1, the linear scan is skipped. If the link count is greater than 1, the linear scan is still performed.
We have been using a patch previous to 105720-08 (maybe it was 105720-07, I don't know -- they just mailed me a file called "nfs" that I stuck in /kernel/fs :) in production for several months, and there is a definite performance boost in our environment using this workaround, with no visible side effects. We also reduced the ncsize without seeing a reduction in the DNLC hit rate, and saw an additional performance boost there. That's evidence suggesting that there is a performance hit for having a large DNLC, even after getting rid of that pesky linear scan.
--------------------------------| It's Your Internet! Alf Mikula |-------------------------------------- IT Systems Administrator | Alf.Mikula@corp.earthlink.net Earthlink Network | (626)296-5515
I am getting ready to convert from an F220 and an F230 to an F740. Does anyone have any recomendations on how to copy all of the data from the two old filers to the new one? I know I can use dump and restore but I was wondering if maybe "vol copy" would be more efficient. Can I use "vol copy" to copy the entire contents of two existing filers to a single new one? If so, can it merge the contents into a single volume or should I create multiple volumes?
Vol copy will copy an entire volume to another volume that is the same size or larger. It will not merge two volumes. You can do things like: vol copy for the larger volume and dump/restore the smaller volume if you want the data in the same place.
Mike Federwisch
I am getting ready to convert from an F220 and an F230 to an F740. Does anyone have any recomendations on how to copy all of the data from the two old filers to the new one? I know I can use dump and restore but I was wondering if maybe "vol copy" would be more efficient. Can I use "vol copy" to copy the entire contents of two existing filers to a single new one? If so, can it merge the contents into a single volume or should I create multiple volumes?
-- David H. Brierley Raytheon Systems Company - Portsmouth RI Facility Work: dhb@ssd.ray.com Home: dave@galaxia.com
"David H. Brierley" wrote:
I am getting ready to convert from an F220 and an F230 to an F740. Does anyone have any recomendations on how to copy all of the data from the two old filers to the new one? I know I can use dump and restore but I was wondering if maybe "vol copy" would be more efficient. Can I use "vol copy" to copy the entire contents of two existing filers to a single new one? If so, can it merge the contents into a single volume or should I create multiple volumes?
I've used vol copy on one migration. It worked well, and was very fast. I've used ndmpcopy for the migration of a dozen or so filers and i like it a bit better, mainly because it is much more flexible as far as where the data is placed on the destination filer. ndmpcopy is slightly slower that vol copy, but not enough so that it would sway my opinion that it is the way to go.
Graham