Has anyone experienced a Solaris 2.5.1 (103640-29, in this case) NFS client block on a stat(), read() or fcntl() call only to a particular file (or small number of files)? We've seen this happen on rare occasions in the past. For no apparent reason though, a plague has hit our web server farm in the past week or so. We've had five or six instances recently where the only solution was to reboot the server.
What happens is that a seemingly random file sitting on the Netapp can no longer be read or stat'd or locked by the Solaris client. There is no suspicious lock activity. Even an innocent "cat" or "ls -l" on the file (triggering one of the above sycalls) will hang. A truss shows that the syscall never returns. Other Solaris clients mounting the same filesystems have no problems accessing those files.
The reboot is necessary because there is always at least one Apache httpd process (the first one to see this problem?) that becomes unkillable, thus port 80 is never unbound, and a new parent cannot be started. After a reboot, everything returns to normal. One of our servers had to be rebooted this morning. Eight hours later, it was wedged trying to access that same file, plus another one on the same NFS filesystem.
At this point, I'm blaming Solaris (easy target ;-)), but I have no hard evidence that would incriminate the OS, the application, the network, or the filer. Rebooting the server seems to fix the problem (at least temporarily), suggesting an OS bug. We've never seen this happen with anything besides Apache, suggesting an application bug. However, the three most recent "problem" files have all been on the same exported filesystem on the same Netapp. Filer bug?
Try killing and restarting your lock daemon, rather than rebooting the server. I realize that what you're describing shouldn't rely on that, but it's something you might try (I used to see the same thing with gopher processes).
Bruce
On Tue, 19 Sep 2000, Bruce Sterling Woodcock wrote:
Try killing and restarting your lock daemon, rather than rebooting the server. I realize that what you're describing shouldn't rely on that, but it's something you might try (I used to see the same thing with gopher processes).
Yup, tried that already (with expected results from looking at a lock_dump on the Netapp)... no go. I haven't caught anything in the act of trying an open() on a "stuck" file yet, but I'm guessing that will hang too. The stat() that typically precedes an application opening the file does hang.
"Brian" == Brian Tao taob@risc.org writes:
On Tue, 19 Sep 2000, Bruce Sterling Woodcock wrote:
Try killing and restarting your lock daemon, rather than rebooting the server. I realize that what you're describing shouldn't rely on that, but it's something you might try (I used to see the same thing with gopher processes).
Yup, tried that already (with expected results from looking at
a lock_dump on the Netapp)... no go. I haven't caught anything in the act of trying an open() on a "stuck" file yet, but I'm guessing that will hang too. The stat() that typically precedes an application opening the file does hang. -- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"
Brian -
I typically stop and restart both statd and lockd when I see NFS problems such as you report. These two daemons are closely related.
Perhaps the best way to do this is:
% sh /etc/init.d/nfs.client stop % sh /etc/init.d/nfs.client start
You may also need to clean out /var/statmon (statd state files) in between the stop and start. I have not had to to this recently.
-- Quentin Fennessy Quentin.Fennessy@amd.com Voice: 512.602.3873 Pager: 512.622.6316