Stale NFS errors

List overview All Threads
Download

newer

older

Re: 100% Cpu using quotas under 5.x

RE: Multiple subnets, NT and Netapp

Jay Orr

13 May 1999 13 May '99

6:38 p.m.

I'm sure this is a frequently asked question, but I'm stumped...

Running an F330 (4.2a) on some sun 2.5.1 systems.

/var/adm/messages gives :

May 12 23:51:52 cleats unix: NFS write error on host hardrock: Stale NFS file handle. May 12 23:51:52 cleats unix: (file handle: 939e1400 577cd30c 0 856f1000 49ee7326 0 2ae0500 51356400) May 13 00:07:24 cleats unix: NFS write error on host hardrock: Stale NFS file handle. May 13 00:07:24 cleats unix: (file handle: 939e1400 577cd30c 0 816f1000 3fda490e 0 2ae0500 51356400) May 13 03:13:35 birdland unix: NFS write error on host hardrock: Stale NFS file handle. May 13 03:13:35 birdland unix: (file handle: 939e1400 577cd30c 0 856f1000 d1f07326 0 2ae0500 51356400) May 13 03:24:07 therat unix: NFS write error on host hardrock: Stale NFS file handle. May 13 03:24:07 therat unix: (file handle: 939e1400 577cd30c 0 866f1000 9b927326 0 2ae0500 51356400) May 13 11:38:28 mcbutts unix: NFS write error on host hardrock: Stale NFS file handle. May 13 11:38:28 mcbutts unix: (file handle: 939e1400 577cd30c 0 866f1000 df967326 0 2ae0500 51356400)

Now, I'm trying to figure out WHAT file(s) are doing this. I've read NetApps "Troubleshooting NFS Stale File Handles on Sun Clients (Series of FAQs)" doc, and SunWorld's "Errno Libretto" which discuss tracing these things, but nothing is what it should be.

**1) "use showfh". That's not in 2.5.1. I got a copy of fhfind, but that doesn't work either.

1a) First, it looks in /etc/mnttab for the device id to decide where to search. NO DEVICE ID MATCHES!!!!

1b) it uses the 4th number for the inode number. WRONG -- that translates into a 10 digit INODE number. All the inode numbers are 5-7 digits.

**2) NetApp's doc states that the second number is the device number of the file system, and the third number is the inode number. NO, the 3rd number is "0".

My guess is that the filer uses a 64bit INODE number, while solaris is using a 32bit inode. How can I figure out what is going on??

----------- Fujitsu - Nexion, St. Louis, MO Jay Orr (314) 579-6517

Show replies by date

Stephen C. Losen

13 May 13 May

7:31 p.m.

...

HI

I'm sure this is a frequently asked question, but I'm stumped...

Running an F330 (4.2a) on some sun 2.5.1 systems.

/var/adm/messages gives :

May 12 23:51:52 cleats unix: NFS write error on host hardrock: Stale NFS file handle. May 12 23:51:52 cleats unix: (file handle: 939e1400 577cd30c 0 856f1000 49ee7326 0 2ae0500 51356400)

We have seen this message when two ksh processes running on two DIFFERENT NFS clients both use the SAME .history file. Note that two ksh processes on the same host can safely use the same .history file.

This situation can easily happen if you have several workstations that mount home directories from a central NFS server. If a user has login sessions on two workstations, he is probably using the same history file in each session.

I would be very interested to know why this happens. It is a NFS and ksh problem and is not limited to Netapp filers. We've seen it on Solaris and AIX NFS servers, too. We've only seen the problem with ksh.

We work around this problem by recommending that folks use different history files on different hosts. In my home directory I have a directory called .history and in my .profile I have this:

HISTFILE=$HOME/.history/`hostname`

Steve Losen scl@virginia.edu phone: 804-924-0640

University of Virginia ITC Unix Support

Jay Orr

7:44 p.m.

I appreciate the reply, however we don't have home directories on the toasters. Basicly, we use Clearcase (a version control mgt program) and use the toaster for all the files. Compilations and updates are run overnight.

The thing is, we can't think what would be using the same files that would be causing this problem, and we have no way of finding out what files are causing the errors.

On Thu, 13 May 1999, Stephen C. Losen wrote:

...

...
HI

I'm sure this is a frequently asked question, but I'm stumped...

Running an F330 (4.2a) on some sun 2.5.1 systems.

/var/adm/messages gives :

May 12 23:51:52 cleats unix: NFS write error on host hardrock: Stale NFS file handle. May 12 23:51:52 cleats unix: (file handle: 939e1400 577cd30c 0 856f1000 49ee7326 0 2ae0500 51356400)

We have seen this message when two ksh processes running on two DIFFERENT NFS clients both use the SAME .history file. Note that two ksh processes on the same host can safely use the same .history file.

This situation can easily happen if you have several workstations that mount home directories from a central NFS server. If a user has login sessions on two workstations, he is probably using the same history file in each session.

I would be very interested to know why this happens. It is a NFS and ksh problem and is not limited to Netapp filers. We've seen it on Solaris and AIX NFS servers, too. We've only seen the problem with ksh.

We work around this problem by recommending that folks use different history files on different hosts. In my home directory I have a directory called .history and in my .profile I have this:

HISTFILE=$HOME/.history/`hostname`

Steve Losen scl@virginia.edu phone: 804-924-0640

University of Virginia ITC Unix Support

----------- Fujitsu - Nexion, St. Louis, MO Jay Orr (314) 579-6517

guy＠netapp.com

11:28 p.m.

...

**2) NetApp's doc states that the second number is the device number of the file system,

NetApp's doc is full of prunes. I'll have to go yell at whoever wrote it. We don't have device numbers in the "dev_t" sense....

...

and the third number is the inode number. NO, the 3rd number is "0".

File handles in pre-5.0 releases consist of:

32-bit file ID for mount point 32-bit generation count for mount point 16-bit set of flags for file 8-bit snapshot ID in which file resides 8-bit unused byte 32-bit file ID for file 32-bit generation count for file 16-bit set of flags for export point 8-bit snapshot ID in which export point resides 8-bit unused byte 32-bit file ID for export point 32-bit generation count for export point

where "mount point" is the file or directory the client mounted, and "export point" is the exported file or directory that file or directory is or resides under.

All numbers are *little-endian*, as the processors on filers are little-endian, unless you turn on the "nfs.big_endianize_fileid" option to compensate for NFS clients that hash internal data structures based on a part of the file handle likely to be zero if the file's file ID is little-endian, in which case the file's file ID (but not the export point's file ID or the mount point's file ID) are big-endian.

So that file handle is:

939e1400 mount point file ID: 0x00149E93, or 1351315 577cd30c mount point generation count: 0x0CD37C57, or 215186519 0 file flags, snapshot ID, unused 856f1000 file ID: 0x00106F85, or 1077125 49ee7326 generation count: 0x2673EE49, or 645131849 0 export point flags, snapshot ID, unused 2ae0500 export point file ID: 0x0005AE02, or 372226 51356400 export point generation count: 0x00643551, or 6567249

In 5.0, we had to add a file system ID; file handles in 5.0 and later consist of:

32-bit file ID for mount point 32-bit generation count for mount point 8-bit snapshot ID in which file resides 8-bit unused byte 32-bit file ID for file 32-bit generation count for file 32-bit volume ID for file 32-bit file ID for export point 32-bit combined snapshot ID and generation number for export point; the upper 8 bits of that 32-bit quantity are the snapshot ID and the lower 24 bits are the generation number.

The file's file ID (but not the mount point's or export point's file ID) is *big-endian*; one of the flags lets the filer tell old-format from new-format file handles, so we can send back old-format file handles in response to requests that have old-format file handles, as clients can get quite confused if they get two different file handles for the same file (yes, that really happens with many, perhaps most, UNIX clients, as they allocate two different internal data structures of the type mentioned above - "rnodes" - for the two different file handles, and thus they don't realize that they're the same file).

...

My guess is that the filer uses a 64bit INODE number,

Nope, 32 bits.

guy＠netapp.com

14 May 14 May

12:14 a.m.

Note, though, that "stale file handle" means there's no file on the server with that file handle; a search for the inode number might find a file, and that *might* be a new file with the same pathname (e.g., because some program renamed another file on top of it), or it might be some unrelated file that got assigned that file ID.

guy＠netapp.com

12:20 a.m.

...

32-bit volume ID for file

Unfortunately, there's no command to print out the volume IDs of volumes (unless I've missed it), so you can't use that to find the file system.

guy＠netapp.com

12:45 a.m.

...

Unfortunately, there's no command to print out the volume IDs of volumes (unless I've missed it), so you can't use that to find the file system.

...although I guess you could do "ls -lid" on all the directories on which stuff was mounted from the server in question, looking for one with an inumber equal to the mount point file ID from the file handle in question.

Jay Orr

2:28 p.m.

Ahh, clears up quit a few things. SO, two more questions :

1) Where/how would I "turn on the 'nfs.big_endianize_fileid'" ?

2) Is there a way of converting the numbers without having to do it by hand?

On Thu, 13 May 1999, Guy Harris wrote:

...

...
**2) NetApp's doc states that the second number is the device number of the file system,

NetApp's doc is full of prunes. I'll have to go yell at whoever wrote it. We don't have device numbers in the "dev_t" sense....

...
and the third number is the inode number. NO, the 3rd number is "0".

File handles in pre-5.0 releases consist of:

32-bit file ID for mount point 32-bit generation count for mount point 16-bit set of flags for file 8-bit snapshot ID in which file resides 8-bit unused byte 32-bit file ID for file 32-bit generation count for file 16-bit set of flags for export point 8-bit snapshot ID in which export point resides 8-bit unused byte 32-bit file ID for export point 32-bit generation count for export point

where "mount point" is the file or directory the client mounted, and "export point" is the exported file or directory that file or directory is or resides under.

All numbers are *little-endian*, as the processors on filers are little-endian, unless you turn on the "nfs.big_endianize_fileid" option to compensate for NFS clients that hash internal data structures based on a part of the file handle likely to be zero if the file's file ID is little-endian, in which case the file's file ID (but not the export point's file ID or the mount point's file ID) are big-endian.

So that file handle is:

939e1400 mount point file ID: 0x00149E93, or 1351315 577cd30c mount point generation count: 0x0CD37C57, or 215186519 0 file flags, snapshot ID, unused 856f1000 file ID: 0x00106F85, or 1077125 49ee7326 generation count: 0x2673EE49, or 645131849 0 export point flags, snapshot ID, unused 2ae0500 export point file ID: 0x0005AE02, or 372226 51356400 export point generation count: 0x00643551, or 6567249

In 5.0, we had to add a file system ID; file handles in 5.0 and later consist of:

32-bit file ID for mount point 32-bit generation count for mount point 8-bit snapshot ID in which file resides 8-bit unused byte 32-bit file ID for file 32-bit generation count for file 32-bit volume ID for file 32-bit file ID for export point 32-bit combined snapshot ID and generation number for export point; the upper 8 bits of that 32-bit quantity are the snapshot ID and the lower 24 bits are the generation number.

The file's file ID (but not the mount point's or export point's file ID) is *big-endian*; one of the flags lets the filer tell old-format from new-format file handles, so we can send back old-format file handles in response to requests that have old-format file handles, as clients can get quite confused if they get two different file handles for the same file (yes, that really happens with many, perhaps most, UNIX clients, as they allocate two different internal data structures of the type mentioned above - "rnodes" - for the two different file handles, and thus they don't realize that they're the same file).

...
My guess is that the filer uses a 64bit INODE number,

Nope, 32 bits.

----------- Fujitsu - Nexion, St. Louis, MO Jay Orr (314) 579-6517

guy＠netapp.com

7:25 p.m.

...

Ahh, clears up quit a few things. SO, two more questions :

Where/how would I "turn on the 'nfs.big_endianize_fileid'" ?

Well, let me first answer that question with another question:

Why do you want to do so?

If you're running a pre-5.0 release, doing so invalidates all file handles that the machine has handed out in the past, so you'll have to remount all clients that have mounted the machine.

If you're running 5.0 or later, it invalidates only file handles handed out before the machine had 5.0 installed on it...

...but, given that it doesn't affect those other file handles, either

1) doing so won't require you to remount, but it also won't do anything interesting (if all machines have remounted since 5.0 or later was put on the machine, or if the machine has run 5.0 or later since Day One)

2) doing so will require you to remount at least some machines, and it'll probably be difficult to figure out which ones, except by setting the option and seeing what machines get stale file handles (if some machines mounted before 5.0 or later was put on the machine, and hadn't remounted since 5.0 or later was put on the machine).

All it buys you is

1) better performance with *some* client OSes (some versions of IRIX and HP-UX, I think, get better performance; I don't know which versions)

and

2) big-endian file IDs, so that if you decide you can actually get an interesting answer from the procedure I described, the effort involved is *slightly* reduced.

Now, if you still want to do that, "nfs.big_endianize_fileid" is a standard (documented) option, so you'd just:

unmount the clients that will need to remount;

put "options nfs.big_endianize_fileid" in "/etc/rc" (which you might have to do before the remount, if *all* clients need to remount);

type "options nfs.big_endianize_fileid" on the console (ah, the joys of the UNIX tradition of "change the rc file, and then re-type the command on the console", which we, alas, follow);

remount the clients in question.

...

Is there a way of converting the numbers without having to do it by

hand?

None that I know of, other than writing a program to do it.

9579

Age (days ago)

9580

Last active (days ago)

toasters@lists.teaparty.net

8 comments

3 participants

tags (0)

participants (3)

guy＠netapp.com
Jay Orr
Stephen C. Losen