AARRRGGHH!!!
Ok, so I queried the list about this a year ago (early May '99) but am dealing with it again. Guy Harris was kind enough to help, and if you're out there I could use your help again.
We had a power outage a week ago that shut down all our desktop systems - but not our filers and servers. Now, we have at random times during the day a large majority of systems griping :
NFS write error on host hardrock: Stale NFS file handle. (file handle: 2ae0500 51356400 20000000 aa5c8 54417810 5c140000 2ae0500 51356400)
They all have the same File handle, and the toaster has no error messages in its etc/messages files. I tried rebooting a few systems, but they still generate these error messages. I was trying to figure out what file it is by extracting an INODE number, but what Guy said before doesn't match :
In 5.0, we had to add a file system ID; file handles in 5.0 and later consist of:
32-bit file ID for mount point 32-bit generation count for mount point 8-bit snapshot ID in which file resides 8-bit unused byte 32-bit file ID for file 32-bit generation count for file 32-bit volume ID for file 32-bit file ID for export point 32-bit combined snapshot ID and generation number for export
point; the upper 8 bits of that 32-bit quantity are the snapshot ID and the lower 24 bits are the generation number.
The file's file ID (but not the mount point's or export point's file ID) is *big-endian*;
So, any ideas as to how to figure this out?
Thanks!
----------- Jay Orr Systems Administrator Fujitsu Nexion Inc. St. Louis, MO
In 5.0, we had to add a file system ID; file handles in 5.0 and later consist of:
32-bit file ID for mount point 32-bit generation count for mount point 8-bit snapshot ID in which file resides 8-bit unused byte
That should have been
32-bit file ID for mount point 32-bit generation count for mount point 16-bit file handle flags 8-bit snapshot ID in which file resides 8-bit unused byte
so the file handle in question becomes (with leading zeroes added, to make it easier to byte-swap them, assuming your NFS clients are big-endian):
02ae0500 mount point file ID (decimal 372226)
51356400 mount point gen count (decimal 6567249)
00 snapshot ID 00 unused 2000 flags (hex 0020, or WAFL_FH_MULTIVOLUME, i.e. it's a 5.0-and-later-format file handle)
000aa5c8 file ID for file (decimal 697800 - remember, this file ID is big-endian, to keep certain NFS clients that use only certain bits of it when hashing their internal per-file-from-NFS-mounted-file-system structures from hashing lots of files to the same bucket)
54417810 gen count for file (decimal 276316500)
5c140000 volume ID for file (decimal 5212, hex 145c)
02ae0500 file ID for export point (decimal 372226) 51356400 snapID/gen for export (decimal 6567249)
As always, thanks for the good info! I put this info in a text file in my general "Useful stuff" directory. If only I had done that last time...
On Wed, 24 May 2000, Guy Harris wrote:
In 5.0, we had to add a file system ID; file handles in 5.0 and later consist of:
32-bit file ID for mount point 32-bit generation count for mount point 8-bit snapshot ID in which file resides 8-bit unused byte
That should have been
32-bit file ID for mount point 32-bit generation count for mount point 16-bit file handle flags 8-bit snapshot ID in which file resides 8-bit unused byte
so the file handle in question becomes (with leading zeroes added, to make it easier to byte-swap them, assuming your NFS clients are big-endian):
02ae0500 mount point file ID (decimal 372226)
51356400 mount point gen count (decimal 6567249)
00 snapshot ID 00 unused 2000 flags (hex 0020, or WAFL_FH_MULTIVOLUME, i.e. it's a 5.0-and-later-format file handle)
000aa5c8 file ID for file (decimal 697800 - remember, this file ID is big-endian, to keep certain NFS clients that use only certain bits of it when hashing their internal per-file-from-NFS-mounted-file-system structures from hashing lots of files to the same bucket)
54417810 gen count for file (decimal 276316500)
5c140000 volume ID for file (decimal 5212, hex 145c)
02ae0500 file ID for export point (decimal 372226) 51356400 snapID/gen for export (decimal 6567249)
----------- Jay Orr Systems Administrator Fujitsu Nexion Inc. St. Louis, MO
Jay,
I've unpicked these things on occasions before, but your posting of Guy Harris' description was a useful memory jog! I don't think there's anything wrong with it except that there are three "unused bytes", not one.
NFS write error on host hardrock: Stale NFS file handle. (file handle: 2ae0500 51356400 20000000 aa5c8 54417810 5c140000 2ae0500 51356400)
The snapshot id of 32 means the active filing system.
The mount and export inode number is 0x05ae02 (read little-endian) = 372226. You should be able to identify it from that unless you have unreasonably many export/mount points.
The file inode number is 0xaa5c8 (read big-endian) = 697800. Of course it probably isn't there any longer... But the inode number might have been re-used in the same directory, or some adjacent ones may exist there (because of the way inodes are allocated based on directory), or you might find a snapshot still containing that inode number. A bit of "find ... -inum +697791 -inum -697824 -ls" (the right block of 32 inodes) may be what you need.
They all have the same File handle, and the toaster has no error messages in its etc/messages files. I tried rebooting a few systems, but they still generate these error messages.
It's theoretically possible for a system to remember an NFS file handle over a reboot, but it's such an evil thing to do that I am reluctant to believe that is your problem! It's more likely that some client has taken to regularly removing a file that is in use by others.
Your error messages look suspiciously like those from Solaris 2+, though you don't say. I have found that it is possible for Solaris to get dirty blocks into its "buffer cache" (really, all of RAM) that it goes on for hours periodically trying to write to the server and generating the same "stale NFS file handle" message: it does *eventually* give up! I think this happens when the process that wrote the data never fsync'd, and so isn't there any longer to report the error to.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
On Wed, 24 May 2000, Chris Thompson wrote:
Jay,
The mount and export inode number is 0x05ae02 (read little-endian) = 372226. You should be able to identify it from that unless you have unreasonably many export/mount points.
Right, that was the first thing I did - the mount point is, of course, the whole filer :-<
The file inode number is 0xaa5c8 (read big-endian) = 697800. Of course it probably isn't there any longer... But the inode number might have been re-used in the same directory, or some adjacent ones may exist there (because of the way inodes are allocated based on directory), or you might find a snapshot still containing that inode number. A bit of "find ... -inum +697791 -inum -697824 -ls" (the right block of 32 inodes) may be what you need.
This lended itself to show inode #'s in a directory of Big Brother. I recently upgraded it and perhaps there is problems with being run on multiple machines from a shared directory - I need to investigate it further. It looks like it's in a tmp/ directory where it keeps track of things like pids and other stuff.
Thanks all for the help!
----------- Jay Orr Systems Administrator Fujitsu Nexion Inc. St. Louis, MO