Apache/Netapp problem + solution (fwd) - toasters

5 Feb 1998


      What version of ONTAP are you using on the NetApp?
----------------------------------------------------------------
     David R. Van Sandt, dv@corp.earthlink.net, 626/296-5137
       Earthlink Network, Lead System Administrator
    	    "It's Your Internet"
----------------------------------------------------------------
---------- Forwarded message ----------
Date: Wed, 4 Feb 1998 13:39:14 -0500 (EST)
From: rasmus@bellglobal.com
To: toasters@mathworks.com
Subject: Apache/Netapp problem + solution
Apache 1.2.5 on a Solaris 2.5.1 Sparc box connected to a Netapp 630
filer.
Problem symptom: Web server established a connection, but would not
respond to requests.
Looking at the parent httpd process we see that it is in a normal
"wait for something to do" loop:
0.000000 sigsuspend([] <unfinished ...>
     0.415977 --- SIGALRM (Alarm Clock) ---
     0.000214 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call)
<0.415636>
     0.000272 setcontext({uc_sigmask=[ALRM], ...}) = ? <0.000113>
     0.000525 alarm(0)                  = 0 <0.000083>
     0.000389 sigprocmask(SIG_UNBLOCK, [ALRM], NULL) = 0 <0.000087>
     0.000475 sigaction(SIGALRM, {SIG_DFL}, NULL) = 0 <0.000086>
     0.000464 waitid(P_ALL, 0, {si_signo=0, si_code=SI_USER, si_pid=0,
si_uid=0, ...}, WNOHANG|WEXITED|WTRAPPED) = 0 <0.000091>
     0.000499 alarm(0)                  = 0 <0.000084>
     0.000349 sigaction(SIGALRM, {0xef5b8cfc, [], 0}, {SIG_DFL}) = 0 <0.000094>
     0.000570 sigprocmask(SIG_BLOCK, [ALRM], []) = 0 <0.000090>
     0.000577 alarm(1)                  = 0 <0.000086>
     0.000351 sigsuspend([] <unfinished ...>
     0.995342 --- SIGALRM (Alarm Clock) ---
     0.000184 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call)
<0.995330>
     0.000248 setcontext({uc_sigmask=[ALRM], ...}) = ? <0.000107>
     0.000517 alarm(0)                  = 0 <0.000083>
Looking at the spawned child httpd's we see:
0.000000 fcntl(21, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}
And they got there by:
ef5b6fd8 fcntl    (15, 7, 5cfcc)
 ef5b6fd8 _libc_fcntl (15, 7, 5cfcc, 0, 0, 0) + 8
 ef78aba8 s_fcntl  (15, 7, 5cfcc, ef611938, 34, effff944) + 164
 00019138 accept_mutex_on (62d08, 1, 0, effffa78, effffa88, 2) + 18
 0001a874 child_main (6c170, 5d000, 5d000, 5cc00, 5d000, 5b000) + 1f8
 0001ad1c make_child (61880, 2, 61400, 11, 0, 5cfc9) + d4
 0001b3a0 standalone_main (effffb78, 5cf48, 5f400, 5cc00, 75, 5f7b0) + 2a4
 0001b920 main     (3, effffc94, effffca4, 5f3f4, 1, 0) + 2a0
 00017b48 _start   (0, 0, 0, 0, 0, 0) + 5c
Ok, so what are we locked on?
Parent process says:
Current rlimit: 92 file descriptors
   0: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2
      O_RDWR
   1: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2
      O_RDWR
   2: S_IFCHR mode:0620 dev:32,0 ino:162863 uid:100 gid:7 rdev:24,2
      O_RDWR
   4: 0xd000  mode:0444 dev:164,0 ino:8952 uid:0 gid:0 size:0
      O_RDONLY close-on-exec
  15: S_IFCHR mode:0000 dev:32,0 ino:26208 uid:0 gid:0 rdev:42,8491
      O_RDWR
  16: S_IFREG mode:0644 dev:162,1 ino:2924135 uid:0 gid:1 size:13387
      O_WRONLY|O_APPEND
  17: S_IFCHR mode:0000 dev:32,0 ino:6664 uid:0 gid:0 rdev:42,8911
      O_RDWR
  18: S_IFREG mode:0644 dev:162,1 ino:4842016 uid:0 gid:1 size:61789
      O_WRONLY|O_APPEND
  19: S_IFREG mode:0644 dev:162,1 ino:4842022 uid:0 gid:1 size:4551030
      O_WRONLY|O_APPEND
  20: S_IFREG mode:0644 dev:162,1 ino:4842019 uid:0 gid:1 size:29387
      O_WRONLY|O_APPEND
  21: S_IFREG mode:0644 dev:162,1 ino:4842021 uid:0 gid:102 size:0
      O_WRONLY
      advisory write lock set by system 0x7FFF process 2135
Inode 4842021 is:
4842021 -rw-r--r--   1 root     devel          0 Feb  3 15:34 logs/.nfsCAC
So, for some reason the accept queue mutex mechanism is no longer
working.  There are no lock manager error messages from the NetApp, but
this server which has been running fine for months will suddenly no longer
lock.  Other servers sitting right next to it with the same configuration
lock fine.  Would be nice to determine why it stopped working, but the 
solution is to add the the following line to the Apache httpd.conf file:
LockFile /tmp/accept.lock
That will keep the lockfile on the local disk.  This is probably not a bad
idea for anybody running Apache off of a Netapp to do.  It may save you
some future headaches.
-Rasmus