Has anyone experienced a Solaris 2.5.1 (103640-29, in this case)
NFS client block on a stat(), read() or fcntl() call only to a
particular file (or small number of files)? We've seen this happen on
rare occasions in the past. For no apparent reason though, a plague
has hit our web server farm in the past week or so. We've had five or
six instances recently where the only solution was to reboot the
server.
What happens is that a seemingly random file sitting on the Netapp
can no longer be read or stat'd or locked by the Solaris client.
There is no suspicious lock activity. Even an innocent "cat" or "ls
-l" on the file (triggering one of the above sycalls) will hang. A
truss shows that the syscall never returns. Other Solaris clients
mounting the same filesystems have no problems accessing those files.
The reboot is necessary because there is always at least one
Apache httpd process (the first one to see this problem?) that becomes
unkillable, thus port 80 is never unbound, and a new parent cannot be
started. After a reboot, everything returns to normal. One of our
servers had to be rebooted this morning. Eight hours later, it was
wedged trying to access that same file, plus another one on the same
NFS filesystem.
At this point, I'm blaming Solaris (easy target ;-)), but I have
no hard evidence that would incriminate the OS, the application, the
network, or the filer. Rebooting the server seems to fix the problem
(at least temporarily), suggesting an OS bug. We've never seen this
happen with anything besides Apache, suggesting an application bug.
However, the three most recent "problem" files have all been on the
same exported filesystem on the same Netapp. Filer bug?
--
Brian Tao (BT300, taob(a)risc.org)
"Though this be madness, yet there is method in't"