hitz@netapp.com (Dave Hitz) writes:
The ESTALE error simply means that the server couldn't find any file for a given file handle.
If a client writes to a file, but the inode number does not change, would a different client receive ESTALE or a different error?
Anyway, the reason I'm asking all this is because I'm at a loss to explain the following sequence of events. First, a daemon "xx" failed as follows:
Sat Apr 11 00:00:48 1998 xx: (INFO) Removing old directories Sat Apr 11 00:00:49 1998 xx: (INFO) Done removing old directories Sat Apr 11 00:00:49 1998 xx: (INFO) Daemon is using 7024 Kb (limit is 7000) - restarting Sat Apr 11 00:00:49 1998 xx: (INFO) next instruction is execv(...self...) Sat Apr 11 00:00:49 1998 xx: (SEVERE ERROR) Execv failedStale NFS file handle Sat Apr 11 00:00:49 1998 xx: (INFO) Dropping core in directory /xxx/yyy
The Linux kernel on that host also logged the error, showing:
Apr 11 00:00:49 kernel: nfs_revalidate_inode: bin/xx getattr failed, ino=4072292, error=-116
Unfortunately, this version of the kernel doesn't show the "before" and "after" NFS filehandles, but the server did return ESTALE (error=-116 in Linux) according to the kernel. (The NetApp server logs don't show anything intereresting around this time.)
The weird part is that all three timestamps for "xx" precede April 11 00:00:49 by a wide margin.
$ ls -ali /foo/bar/bin/xx* 4072292 -rwx------ [...] Apr 10 13:11 /foo/bar/bin/xx 1161604 -rwx------ [...] Apr 10 11:22 /foo/bar/bin/xx.old
Here is the code sequence that was running on the "xx" daemon when it received ESTALE when it tried to re-execute itself (it had been running since April 8).
------- start of cut text -------------- void re_exec_daemon(void) { char buf[TMP_BUF_SIZE];
disconnect();
if (close(Admin_sock) == -1) { sprintf(buf, "Can't close admin socket (%s)", strerror(errno)); log_mesg(ERROR, buf); }
if (close(User_sock) == -1) { sprintf(buf, "Can't close user socket (%s)", strerror(errno)); log_mesg(ERROR, buf); }
/* jump to directory we started from */ if (chdir(Invoke_dir) == -1) { sprintf(buf, "Can't change to directory %s", Invoke_dir); fatal_error(buf, strerror(errno)); }
/* make sure sockets are closed */ (void) close(Admin_sock); (void) close(User_sock);
log_mesg(INFO, "next instruction is execv(...self...)");
/* Re-execute us */ execv(Invoke_command[0], Invoke_command);
/* should not get here */ fatal_error("Execv failed", strerror(errno)); } ------- end ----------------------------
Dan
The ESTALE error simply means that the server couldn't find any file for a given file handle.
If a client writes to a file, but the inode number does not change, would a different client receive ESTALE or a different error?
You're not describing the full scenario here - the only operations you mention are a write from one client and an unspecified operation from a different client that would get an error. What's the rest of the scenario?
If a file that one client has open is removed while that client still has it open, and a new file is created on the server with the same inode number as the file that was removed, and the client that has the file open tries to perform some operation on it via the file descriptor it has opened for that file, it will get ESTALE (if the operation goes over the wire to the server, which e.g. a "read()" or "stat()" might not if it can be satisfied from a cache on the client) on most if not all UNIX NFS servers, as well as on the filer.
That's because files in most UNIXes, these days, have a "generation count" stored in the inode; if a file with a given inode is removed and some new file later gets that inode, the new file will be given a different generation number. The file handle includes both inode number and generation number, and if the file with that inode number doesn't have exactly that generation number, an NFS request that tries to refer to that file gets ESTALE.
Sat Apr 11 00:00:49 1998 xx: (SEVERE ERROR) Execv failedStale NFS file handle
The line
fatal_error("Execv failed", strerror(errno));
should probably be changed to something such as
fatal_error("Execv failed: ", strerror(errno));
or "fatal_error()" should be changed to add the ": ", to make the error message look more reasonable.
The Linux kernel on that host also logged the error, showing:
Apr 11 00:00:49 kernel: nfs_revalidate_inode: bin/xx getattr failed, ino=4072292, error=-116
It'd be interesting to see a network trace of the NFS traffic between the client and the server, to see
1) what file handle was used to refer to the file on a successful call;
2) what file handle was used to refer to the file on the unsuccessful call if the latter call went over the wire.
It may just be that the client is sending a bad file handle over the wire.