Executive summary: Our core problem appears to be that an NFS load consisting of many file creation/removal operations eventually cause our Solaris 2.6 and Solaris 8 servers to perform very poorly. The culprit seems to be kernel R/W lock contention by the caller 'nfs_access_check' for the lock 'nfs_acache_lock'. What is unclear is what causes the lock contention. It seems to build over time eventually killing the performance of the server. Our current work around is to reboot the server every morning or when the problem occurs. See details below for further information.
We recently moved /var/mail from local storage on a Solaris 2.6 server to our F760 and ever since we are suffering from serious performance problems on our Solaris server. We have about 16,000 users and deliver about 100,000 - 150,000 message a day. It is important to note that the we have a few other solaris servers that communicate with our filer and they do not experience any problems. It really seems to be related to the high volume of traffic generated by moving /var/mail over to the filer. Note that we use dot-locking only and therefore are not troubled by the well-known problems with fcntl-based locking over NFS.
It is worth noting that we experienced a similar problem a few months ago with a web based email package we wrote. The email package uses form based authentication (as opposed to browser based) and kept relevant session data in a flat file (about 200 bytes big). Originally we had all 18,000 files in a single directory. We tried hashing the directory so that we had about 300 files per directory, but it did not make a difference. We then moved to using a Mysql database to store the session data, and the problem went away. Before moving to the database implementation we tried a two different physical machines (both with the characteristics defined below) as well as Solaris 2.6 and Solaris 8, none of which made a difference.
It seems very clear that the symptom is brought on by a heavy load of file creation and removal. We are not sure if file size is an issue, but it could be given the size of our session files and the fact that dot-lock files are 0 bytes.
We do have an open ticket with Network Appliance case# 176124 but so far we haven't made much progress with it.
On to the details ...
Filer: F760 Release 5.3.6R1 Solaris 2.6 Box: Hardware is AXmp, 4 360MHz UltraSPARC-II CPUs w/ 4MB cache, and 2Gig RAM mount options: vers=2/proto=udp/rsize=32768/wsize=32768/noac/remote
Our server tends to perform well for about 24-30hrs depending on load. At some point after that we start seeing bad performance. We have gathered the following statistics
From the Solaris Box iostat, mpstat, netstat -k hme0, netstat -k, netstat -s, nfsstat -m, nfsstat -cnr, vmstat, vmstat -i, vmstat -s, top -n -dinfinity -s60 0
From the Filer netstat -m, netstat -s, nfsstat -d, nfsstat -l, statit, sysstat
The only place I saw anything out of the ordinary from the above stats when the system was working properly vs. having problems was from the output of top and mpstat. I have made all stats available at
http://static-content-01.communitytelephone.net/~hill/server_stats
In case anyone wanted to see them.
Good performance last pid: 505; load averages: 2.26, 2.24, 2.29 14:30:04 407 processes: 393 sleeping, 3 running, 9 zombie, 2 on cpu CPU states: 32.1% idle, 40.0% user, 27.9% kernel, 0.0% iowait, 0.0% swap Memory: 2048M real, 444M free, 1000M swap free
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 516 0 505 281 159 615 23 55 17 1 1970 17 12 4 67 1 298 0 8056 125 0 810 21 63 31 1 1974 13 21 3 62 2 579 0 2056 313 100 717 15 61 33 0 6216 12 12 2 74 3 991 0 832 871 721 547 31 64 40 1 4804 18 24 5 53
Bad performance last pid: 8130; load averages: 21.64, 18.25, 13.56 18:13:05 315 processes: 301 sleeping, 6 running, 6 zombie, 2 on cpu CPU states: 1.4% idle, 28.3% user, 69.9% kernel, 0.4% iowait, 0.0% swap Memory: 2048M real, 93M free, 18M swap in use, 982M swap free
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 608 0 1943 703 366 898 180 131 24 309 2861 24 75 0 2 1 512 0 3161 336 0 936 172 140 18 330 3033 26 72 0 2 2 633 0 10496 515 100 963 168 125 34 330 2712 23 76 0 2 3 607 0 1442 1262 953 883 164 139 31 291 5426 30 68 0 2
You can see from the mpstat output that icsw, srw, usr, and sys have somewhat different characteristics between the two machine states. I found the high srw numbers particularly interesting. I believe they are responsible for the change in the icsw, usr, and sys numbers. (i.e. when there is a lot of lock contention the sys cpu utilization goes up, which could also cause the involuntary context switches (icsw) to increase.)
I ran 'lockstat sleep 30' during a bad performance period to try and determine the cause of the high srw numbers. Here is what I saw
R/W writer blocked by writer: 135 events
Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------- 103 76% 76% 1.00 271500476 nfs_acache_lock nfs_access_cache+0x1c 23 17% 93% 1.00 311148519 nfs_acache_lock nfs_access_purge_rp+0x24 3 2% 96% 1.00 7275646 0x61a87758 ufs_remove+0x11c 2 1% 97% 1.00 17863392 0x61a87758 ufs_rename+0x2c4 2 1% 99% 1.00 19876013 0x61a87758 ufs_create+0x1a8 1 1% 99% 1.00 46461 tim_list_rwlock tim_addlink+0x64 1 1% 100% 1.00 8152217 0x61a87758 ufs_link+0x13c -------------------------------------------------------------------------
R/W writer blocked by readers: 1356 events
Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------- 1135 84% 84% 13.47 142070011 nfs_acache_lock nfs_access_cache+0x1c 218 16% 100% 10.52 100639598 nfs_acache_lock nfs_access_purge_rp+0x24 2 0% 100% 1.00 603387 0x61a87758 ufs_rename+0x2c4 1 0% 100% 1.00 121137 0x61a87758 ufs_rename+0x25c -------------------------------------------------------------------------
R/W reader blocked by writer: 1608 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 1600 100% 100% 1.00 103571147 nfs_acache_lock nfs_access_check+0x18 7 0% 100% 1.00 13694146 0x61a87758 ufs_dirlook+0xc0 1 0% 100% 1.00 5391663 0x61a86e28 ufs_dirlook+0xc0 ----------------------------------------------------------------------
R/W reader blocked by write wanted: 36530 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 36530 100% 100% 17.19 55578316 nfs_acache_lock nfs_access_check+0x18 ----------------------------------------------------------------------
For grins I did the same while the machine was performing okay R/W writer blocked by writer: 11 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 4 36% 36% 1.00 34507 nfs_acache_lock nfs_access_cache+0x1c 3 27% 64% 1.00 41125 kas+0x1c as_map+0x6c 2 18% 82% 1.00 94435 kas+0x1c as_unmap+0x48 1 9% 91% 1.00 238723 0x61b7f720 write+0x84 1 9% 100% 1.00 8324704 0x618283a0 ufs_create+0x1a8 ----------------------------------------------------------------------
R/W writer blocked by readers: 37 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 24 65% 65% 1.04 114407 nfs_acache_lock nfs_access_cache+0x1c 10 27% 92% 1.00 93271 kas+0x1c as_unmap+0x48 2 5% 97% 1.00 101398 kas+0x1c as_map+0x6c 1 3% 100% 1.00 171407 0x618283a0 ufs_rename+0x2c4 ----------------------------------------------------------------------
R/W reader blocked by writer: 61 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 53 87% 87% 1.00 295492 nfs_acache_lock nfs_access_check+0x18 4 7% 93% 1.00 78438 kas+0x1c as_fault+0x300 2 3% 97% 1.00 8945909 0x61a45214 ufs_getattr+0x50 1 2% 98% 1.00 35317881 0x61a45208 ufs_dirlook+0xc0 1 2% 100% 1.00 3329355 0x618283a0 ufs_dirlook+0xc0 ----------------------------------------------------------------------
R/W reader blocked by write wanted: 3 events
Count indv cuml rcnt nsec Lock Caller ---------------------------------------------------------------------- 3 100% 100% 1.00 164763 nfs_acache_lock nfs_access_check+0x18 ----------------------------------------------------------------------
From the percentages it seems that nfs_access_check/nfs_acache_lock is the dominant player in the srw lock contention.
What I don't understand, is what causes the drastic change in lock contention over time. I've spent a few hours on sunsolve.sun.com and docs.sun.com looking for information on nfs_access_check/nfs_acache_lock and have come up empty. I have also gone through Sun Performance and Tuning Second Edition, and the toasters mailling list with no luck.
At this point I'm not sure if there is a netapp incompatiliby with Solaris, or if there is a bug in Solaris NFS client implementation. If anyone out there has any ideas I would be extrememly grateful.
btw: we have made the following changes to /etc/system with no noticeable change in system behaviour.
* a few kernel params are related to maxusers which is by default a few * less than amount of ram in system in MB with a default limit of 1024 * by adjusting this up certain other values are also adjusted, which is the * goal here. pp. 359 Sun performance tuning, but the decision to increase * came from the Sun System Tuning white page pp. 69 set maxusers=1536
* These settings are to help tune nfs performance to the netapp -JPH * 12.19.00 per toaster article subject: NFS client tuning, problem and * resolution / Date: Tue, 10 Aug 1999 / From: Nick Christenson * npc@sendmail.com * set ncsize=8192 set ncsize=6144
* It appears that the increase in this number has resulted in mutex * contention viewable by mpstat (smtx) and lockstat *set nfs:nfs_nra=10 *set nfs:nfs3_nra=10
* Tweaks from SE and Sun Performance Tuning JPH 12.25.00 * http://www.sun.com/sun-on-net/performance/priority_paging.html set priority_paging=1
* Connection hash table size default 256 increased to make hash lookups * more efficient set tcp:tcp_conn_hash_size=8192
* inode cache size - can view with 'netstat -k inode_cache' * ideally this should be 4/3 the size of ncsize set ufs_ninode=8192
* Things tweaked by virtual_adrian from the se toolkit * There is also precedent for this in Sun Performance Tuning and * Solaris Internals pp. 599 set slowscan=500 set maxpgio=25468 set fastscan=64306 set handspreadpages=30960
--Jamie Hill
----- Original Message ----- From: hill@communitytelephone.com To: toasters@mathworks.com Cc: access-systems@communitytelephone.net Sent: Tuesday, January 02, 2001 3:00 PM Subject: Bad Solaris performance over time
It really seems to be related to the high volume of traffic generated by moving /var/mail over to the filer. Note that we use dot-locking only and therefore are not troubled by the well-known problems with fcntl-based locking over NFS.
What problems, other than buggy lock managers that no longer exist in Solaris 2.6?
Bruce
It really seems to be related to the high volume of traffic generated by moving /var/mail over to the filer. Note that we use dot-locking only and therefore are not troubled by the well-known problems with fcntl-based locking over NFS.
What problems, other than buggy lock managers that no longer exist in Solaris 2.6?
Umm... unless I'm waaaaay off base, these numbers seem outrageously low. I'm kinda boggled. I was going to let this pass, but no, I must comment.
In 1995-6, even with buggy lock managers, we ran the mail for a certain large ISP (~25,000 users) here in town. We pumped an average of 450,000 mail messages a day through a pair of dual-85Mhz Sparc20's running SunOS 4.1.4, connected via FDDI to a single F330. <Pause. Allow that to sink in. 85Mhz processors. *SunOS 4.1.4*. FDDI. A single F330 w/8MB of NVRAM. Those are museum pieces nowadays.>
One box handled incoming SMTP and all local delivery (sendmail+procmail) and POP, the other handled some unholy number of Majordomo lists. I mean, this was with hardware that nowadays people scoff at, running ancient software releases and NFS V2 over a single network interface. I find it impossible to believe that a 4-way UltraII and an F760 can't handle the load. (Yeah, the average mail message with all the crap "modern" mailers add on is probably several Kb larger, but that's almost never the real bottleneck.)
Some questions that may be a bit obvious: Is your 2.6 box patched to current levels? Why NFS V2? How much NVRAM in the filer? Is the network bogged down, or the switch misconfigured? How many network interfaces, of what type/speed? What does snoop/tcpdump/your network sniffer say? How many disks in the volume that contains /var/mail? Is this filer running CIFS & NFS, or is it purely an NFS box? If /var/mail is on the filer, what about home directories (i.e., is your mailer bogging doing ".forward" lookups)? Are you using NIS and/or DNS on the filer? What version of sendmail/qmail/whatever? Are you using procmail for delivery? Why not? :-) Is disk scrubbing kicking in? Are you taking snapshots too frequently? Do you have cron jobs walking the filesystem or other network clients hitting the filer or the network when these slowdowns are observed on the mail server? And really, why NFS V2?
(I'm not really a smartass, I just play one on the Internet. :-) These are just some of the questions I'd look to answer before poking at /etc/system. Given that I've seen Sun+Netapp equipment two generations older doing three times the volume you're seeing, I can't believe there isn't some other underlying problem that's plaguing your setup. It's precisely because we pounded the snot out of that old gear and it handled the load that I still recommend and use it to this day. (Well, not SunOS 4.1.4, R.I.P., or that there's enough money in the world to convince me to ever, ever work for an ISP again! :-) I'm very curious to know what the solution turns out to be.
-- Chris
-- Chris Lamb, Unix Guy MeasureCast, Inc. 503-241-1469 x247 skeezics@measurecast.com
----- Original Message ----- From: hill@communitytelephone.com To: toasters@mathworks.com Cc: access-systems@communitytelephone.net Sent: Tuesday, January 02, 2001 3:00 PM Subject: Bad Solaris performance over time
The only thing I note interesting is:
- It appears that the increase in this number has resulted in mutex
- contention viewable by mpstat (smtx) and lockstat
*set nfs:nfs_nra=10 *set nfs:nfs3_nra=10
Not sure what happens with them commented out completely but that seems excessive? I'm not sure what the default is (I have heard both 4 and 8). I used to run servers with the following settings:
set nfs:nfs_nra = 5 set nfs:nfs3_nra = 5 set nfs:nfs_max_threads = 12 set nfs:nfs3_max_threads = 12
However, the "state of the art" thinking on these parameters may be different today.
Bruce