New subject: stale NFS file handles

14 Apr 1998

      guy@netapp.com (Guy Harris) writes:
...
You're not describing the full scenario here - the only operations you
mention are a write from one client and an unspecified operation from a
different client that would get an error.  What's the rest of the
scenario?
The most common scenario where this occurs here is for ELF
executables.  It may happen other places, but escape notice.
It may be generally useful for NetApp to publish all of the known
cases where different (legitimate to the NFS specification) errors can
be returned from Network Appliance servers.  For example, what set of
circumstances (or client bugs) can lead to ESTALE.
Another time, we had a problem running "ls -l" from client "A" on a
binary that had been overwritten from client "B" while one copy of the
older version was already running on "A".  It was still in the
directory cache, but half of the stat would fail (the directory cache
lookup would give a nonexistent inode number, and the getattr would
return an I/O error).
...
If a file that one client has open is removed while that client still
has it open, and a new file is created on the server with the same inode
number as the file that was removed, and the client that has the file
open tries to perform some operation on it via the file descriptor it
has opened for that file, it will get ESTALE (if the operation goes over
the wire to the server, which e.g. a "read()" or "stat()" might not if
it can be satisfied from a cache on the client) on most if not all UNIX
NFS servers, as well as on the filer.
Yes.
...
The line
fatal_error("Execv failed", strerror(errno));
should probably be changed to something such as
fatal_error("Execv failed: ", strerror(errno));
or "fatal_error()" should be changed to add the ": ", to make the error
message look more reasonable.
I agree.  It's not my code, but I said about the same thing to the
author.  :-)
...
...
The Linux kernel on that host also logged the error, showing:
Apr 11 00:00:49 kernel: nfs_revalidate_inode: bin/xx getattr failed,
   ino=4072292, error=-116
It'd be interesting to see a network trace of the NFS traffic between
the client and the server, to see

what file handle was used to refer to the file on a
successful call;

what file handle was used to refer to the file on the
unsuccessful call if the latter call went over the wire.

Newer versions of the Linux kernel log this information.
Unfortunately, the sheer volume of NFS traffic and relative
infrequency of errors makes logging this information unfeasible.
...
It may just be that the client is sending a bad file handle over the
wire.
A possibility.
Thanks for all the info.
Dan

Re: stale NFS file handles