guy@netapp.com (Guy Harris) writes:
You're not describing the full scenario here - the only operations you mention are a write from one client and an unspecified operation from a different client that would get an error. What's the rest of the scenario?
The most common scenario where this occurs here is for ELF executables. It may happen other places, but escape notice.
It may be generally useful for NetApp to publish all of the known cases where different (legitimate to the NFS specification) errors can be returned from Network Appliance servers. For example, what set of circumstances (or client bugs) can lead to ESTALE.
Another time, we had a problem running "ls -l" from client "A" on a binary that had been overwritten from client "B" while one copy of the older version was already running on "A". It was still in the directory cache, but half of the stat would fail (the directory cache lookup would give a nonexistent inode number, and the getattr would return an I/O error).
If a file that one client has open is removed while that client still has it open, and a new file is created on the server with the same inode number as the file that was removed, and the client that has the file open tries to perform some operation on it via the file descriptor it has opened for that file, it will get ESTALE (if the operation goes over the wire to the server, which e.g. a "read()" or "stat()" might not if it can be satisfied from a cache on the client) on most if not all UNIX NFS servers, as well as on the filer.
Yes.
The line
fatal_error("Execv failed", strerror(errno));
should probably be changed to something such as
fatal_error("Execv failed: ", strerror(errno));
or "fatal_error()" should be changed to add the ": ", to make the error message look more reasonable.
I agree. It's not my code, but I said about the same thing to the author. :-)
The Linux kernel on that host also logged the error, showing:
Apr 11 00:00:49 kernel: nfs_revalidate_inode: bin/xx getattr failed, ino=4072292, error=-116
It'd be interesting to see a network trace of the NFS traffic between the client and the server, to see
what file handle was used to refer to the file on a successful call;
what file handle was used to refer to the file on the unsuccessful call if the latter call went over the wire.
Newer versions of the Linux kernel log this information. Unfortunately, the sheer volume of NFS traffic and relative infrequency of errors makes logging this information unfeasible.
It may just be that the client is sending a bad file handle over the wire.
A possibility.
Thanks for all the info.
Dan