Apache 1.2.5 on a Solaris 2.5.1 Sparc box connected to a Netapp 630 filer.
Problem symptom: Web server established a connection, but would not respond to requests.
Looking at the parent httpd process we see that it is in a normal "wait for something to do" loop:
0.000000 sigsuspend([] <unfinished ...> 0.415977 --- SIGALRM (Alarm Clock) --- 0.000214 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call) <0.415636> 0.000272 setcontext({uc_sigmask=[ALRM], ...}) = ? <0.000113> 0.000525 alarm(0) = 0 <0.000083> 0.000389 sigprocmask(SIG_UNBLOCK, [ALRM], NULL) = 0 <0.000087> 0.000475 sigaction(SIGALRM, {SIG_DFL}, NULL) = 0 <0.000086> 0.000464 waitid(P_ALL, 0, {si_signo=0, si_code=SI_USER, si_pid=0, si_uid=0, ...}, WNOHANG|WEXITED|WTRAPPED) = 0 <0.000091> 0.000499 alarm(0) = 0 <0.000084> 0.000349 sigaction(SIGALRM, {0xef5b8cfc, [], 0}, {SIG_DFL}) = 0 <0.000094> 0.000570 sigprocmask(SIG_BLOCK, [ALRM], []) = 0 <0.000090> 0.000577 alarm(1) = 0 <0.000086> 0.000351 sigsuspend([] <unfinished ...> 0.995342 --- SIGALRM (Alarm Clock) --- 0.000184 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call) <0.995330> 0.000248 setcontext({uc_sigmask=[ALRM], ...}) = ? <0.000107> 0.000517 alarm(0) = 0 <0.000083>
Looking at the spawned child httpd's we see:
0.000000 fcntl(21, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}
And they got there by:
ef5b6fd8 fcntl (15, 7, 5cfcc) ef5b6fd8 _libc_fcntl (15, 7, 5cfcc, 0, 0, 0) + 8 ef78aba8 s_fcntl (15, 7, 5cfcc, ef611938, 34, effff944) + 164 00019138 accept_mutex_on (62d08, 1, 0, effffa78, effffa88, 2) + 18 0001a874 child_main (6c170, 5d000, 5d000, 5cc00, 5d000, 5b000) + 1f8 0001ad1c make_child (61880, 2, 61400, 11, 0, 5cfc9) + d4 0001b3a0 standalone_main (effffb78, 5cf48, 5f400, 5cc00, 75, 5f7b0) + 2a4 0001b920 main (3, effffc94, effffca4, 5f3f4, 1, 0) + 2a0 00017b48 _start (0, 0, 0, 0, 0, 0) + 5c
Ok, so what are we locked on?
Parent process says:
Current rlimit: 92 file descriptors 0: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2 O_RDWR 1: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2 O_RDWR 2: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2 O_RDWR 4: 0xd000 mode:0444 dev:164,0 ino:8952 uid:0 gid:0 size:0 O_RDONLY close-on-exec 15: S_IFCHR mode:0000 dev:32,0 ino:26208 uid:0 gid:0 rdev:42,8491 O_RDWR 16: S_IFREG mode:0644 dev:162,1 ino:2924135 uid:0 gid:1 size:13387 O_WRONLY|O_APPEND 17: S_IFCHR mode:0000 dev:32,0 ino:6664 uid:0 gid:0 rdev:42,8911 O_RDWR 18: S_IFREG mode:0644 dev:162,1 ino:4842016 uid:0 gid:1 size:61789 O_WRONLY|O_APPEND 19: S_IFREG mode:0644 dev:162,1 ino:4842022 uid:0 gid:1 size:4551030 O_WRONLY|O_APPEND 20: S_IFREG mode:0644 dev:162,1 ino:4842019 uid:0 gid:1 size:29387 O_WRONLY|O_APPEND 21: S_IFREG mode:0644 dev:162,1 ino:4842021 uid:0 gid:102 size:0 O_WRONLY advisory write lock set by system 0x7FFF process 2135
Inode 4842021 is:
4842021 -rw-r--r-- 1 root devel 0 Feb 3 15:34 logs/.nfsCAC
So, for some reason the accept queue mutex mechanism is no longer working. There are no lock manager error messages from the NetApp, but this server which has been running fine for months will suddenly no longer lock. Other servers sitting right next to it with the same configuration lock fine. Would be nice to determine why it stopped working, but the solution is to add the the following line to the Apache httpd.conf file:
LockFile /tmp/accept.lock
That will keep the lockfile on the local disk. This is probably not a bad idea for anybody running Apache off of a Netapp to do. It may save you some future headaches.
-Rasmus