hitz@netapp.com (Dave Hitz) writes:
The ESTALE error simply means that the server couldn't find any file for a given file handle.
If a client writes to a file, but the inode number does not change, would a different client receive ESTALE or a different error?
Anyway, the reason I'm asking all this is because I'm at a loss to explain the following sequence of events. First, a daemon "xx" failed as follows:
Sat Apr 11 00:00:48 1998 xx: (INFO) Removing old directories Sat Apr 11 00:00:49 1998 xx: (INFO) Done removing old directories Sat Apr 11 00:00:49 1998 xx: (INFO) Daemon is using 7024 Kb (limit is 7000) - restarting Sat Apr 11 00:00:49 1998 xx: (INFO) next instruction is execv(...self...) Sat Apr 11 00:00:49 1998 xx: (SEVERE ERROR) Execv failedStale NFS file handle Sat Apr 11 00:00:49 1998 xx: (INFO) Dropping core in directory /xxx/yyy
The Linux kernel on that host also logged the error, showing:
Apr 11 00:00:49 kernel: nfs_revalidate_inode: bin/xx getattr failed, ino=4072292, error=-116
Unfortunately, this version of the kernel doesn't show the "before" and "after" NFS filehandles, but the server did return ESTALE (error=-116 in Linux) according to the kernel. (The NetApp server logs don't show anything intereresting around this time.)
The weird part is that all three timestamps for "xx" precede April 11 00:00:49 by a wide margin.
$ ls -ali /foo/bar/bin/xx* 4072292 -rwx------ [...] Apr 10 13:11 /foo/bar/bin/xx 1161604 -rwx------ [...] Apr 10 11:22 /foo/bar/bin/xx.old
Here is the code sequence that was running on the "xx" daemon when it received ESTALE when it tried to re-execute itself (it had been running since April 8).
------- start of cut text -------------- void re_exec_daemon(void) { char buf[TMP_BUF_SIZE];
disconnect();
if (close(Admin_sock) == -1) { sprintf(buf, "Can't close admin socket (%s)", strerror(errno)); log_mesg(ERROR, buf); }
if (close(User_sock) == -1) { sprintf(buf, "Can't close user socket (%s)", strerror(errno)); log_mesg(ERROR, buf); }
/* jump to directory we started from */ if (chdir(Invoke_dir) == -1) { sprintf(buf, "Can't change to directory %s", Invoke_dir); fatal_error(buf, strerror(errno)); }
/* make sure sockets are closed */ (void) close(Admin_sock); (void) close(User_sock);
log_mesg(INFO, "next instruction is execv(...self...)");
/* Re-execute us */ execv(Invoke_command[0], Invoke_command);
/* should not get here */ fatal_error("Execv failed", strerror(errno)); } ------- end ----------------------------
Dan