This is similar to a problem we have been experiencing with increasing frequency. We run large batch jobs that compile our S/W. During these batch jobs we get "file not found" or "make: no rule to create target" type errors where we can prove the files existed on the filler when the Suns reported them missing. We don't have a "hang" just a failure to stat().
The problem has manifested itself on both our Solaris 2.5.1 and Solaris 8 systems. The filler is a F760 (5.3.5R2P2). The load on the filler when this happens is typically 10,000 ops/sec or greater. Netapp has asked for packett traces but we are talking gigabit interfaces here, the trace files are huge and "pktt" is dropping about 80% of the packets anyway.
We have already cut "nfs.udp.xfersize" to 8K and are running out of ideas. I am getting hauled in front of management on a regular basis to explain how I am going to make the problem go away. Now all I need is a solution.....
What type of filler do you have and what version of code is it running? I know it's obvious, but have you checked for duplex problems? If you find a solution I would really like to hear about it.
Graydon Dodson (606) 232-6483 grdodson@lexmark.com Lexmark International Inc.
On Tue, 19 Sep 2000, Graydon Dodson wrote:
During these batch jobs we get "file not found" or "make: no rule to create target" type errors where we can prove the files existed on the filler when the Suns reported them missing. We don't have a "hang" just a failure to stat().
Yeah, that would be a different kind of "bad". ;-) In our case, an Apache httpd instance that hangs on a stat() is unkillable, and thus takes up one process slot. Eventually all slots are filled and Apache is unable to service more requests. You can't kill off the parent process and restart, since port 80 cannot be unbound. The only thing to do is reboot.
The problem has manifested itself on both our Solaris 2.5.1 and Solaris 8 systems.
Even Solaris 8, eh? Ugh. :(
We have already cut "nfs.udp.xfersize" to 8K and are running out of ideas. I am getting hauled in front of management on a regular basis to explain how I am going to make the problem go away. Now all I need is a solution.....
It has become a daily occurrence here, starting about two weeks ago, and we have not been able to correlate any changes in that timeframe that would explain the increase in frequency. Same Solaris patch level, same Sparc hardware, same Netapps, same Data ONTAP, same network switches. We even lessened the load on the servers by adding more of them and spreading customers around (this is in a shared web and mail hosting facility). I tried changing our mounts to NFSv2 to see if that makes a difference.
What type of filler do you have and what version of code is it running? I know it's obvious, but have you checked for duplex problems? If you find a solution I would really like to hear about it.
The network is fine... duplex and speed match, no errors on the filer or Sparc interfaces. We are running two F740's (clustered), release 5.3.4.