We are experiencing a repeatable problem when creating files with long filenames under Digital Unix 4.0 on a filesystem mounted from a NetApp filer. The files are not visible until the filesystem is remounted - that is, when you create the files, you cannot see them. If you mount the filesystem at another mount point, unmount it or remount it, or simply look at it on another nfs client, you can see the files. This apparently only happens when using nfs version 3.
We have observed this bug only between a Digital Unix 4.0 (4.0A and 4.0D, at least) client an a NetApp server. We have tested Solaris 2.5 and 2.6 clients, Digital Unix, NetApp, Auspex, and Solaris servers.
Anyone have an idea what could cause this?
Naturally, the first response from Digital is that it is likely to be a NetApp problem, and the first response from NetApp is that it is likely to be a Digital Unix problem. Any suggestions on how to determine where the problem lies?
Here is a script from a DEC support person that demonstrates the problem: -------------------- !#/bin/csh
# create a log of datafiles and links to test nfs problem
while ( $num < 100 ) @ num = $num + 1 cat /etc/printcap > reallyreallyverylongnamedatafile.$num ln -s reallyreallyverylongnamedatafile.$num 1.$num echo "$num is done" end
echo done
exit --------------------
When you run it, it should generate 100 files and 100 links. You should be able to run the script, and then "ls | wc" and see about 200 words. When the bug appears, you see only ~160 words. However, you can run the "ls | wc" on another nfs client mounting the same filesystem, and see the expected 200 words.
Thanks,
David Ritch
Naturally, the first response from Digital is that it is likely to be a NetApp problem, and the first response from NetApp is that it is likely to be a Digital Unix problem. Any suggestions on how to determine where the problem lies?
What software release is the NetApp box running?
Bug 4604, "NFS v3 layer ignores readdirplus dircount parameter", has caused problems with DU 4.0 clients; the fix is in 5.0, so if you were running a pre-5.0 release, that might be the cause of this problem.
I missed Guy's e-mail before my earlier e-mail. I've confirmed this problem against an internal future release that has the fix for bug 4604. We are continuing to investigate. I did notice in my attempts to reproduce the problem last night that it did not happen every time.
John Edwards
On Thu, 17 Sep 1998, Guy Harris wrote:
Bug 4604, "NFS v3 layer ignores readdirplus dircount parameter", has caused problems with DU 4.0 clients; the fix is in 5.0, so if you were running a pre-5.0 release, that might be the cause of this problem.
Naturally, the first response from Digital is that it is likely to be a NetApp problem, and the first response from NetApp is that it is likely to be a Digital Unix problem. Any suggestions on how to determine where the problem lies?
I would guess that if you exported a filesystem from a SUN, and could repeat the problem on the DEC box, you would have good ammo to chew out someone at DEC. If the Sun export works, then start chewing on NetApp.
---------- !#/bin/csh # a few modifications (use lptest instead of printcap)
set num=0 while ( $num < 100 ) @ num = $num + 1 lptest > reallyreallyverylongnamedatafile.$num ln -s reallyreallyverylongnamedatafile.$num 1.$num echo "$num is done" end echo "And the magic answer is..." ls | wc exit ----------
On sparc 20, mounting from a F330 running 4.0.1c: dunsel% ls | wc 201 201 4119
- Christoph - Hey greg, remember dunsel???
We are investigating this issue internally. We have reproduced it. Our intial investigations show that the Digital Unix client sends a readdir request and we respond to it. We mark in our reply that we are not at the end of the directory, but the client never comes back for more. Our suspicions are that the client is deciding it has all of the directory information based on some client side cached information or the cookie we pass back. We have not yet determined with certainty whether it is a client or server bug.
John Edwards Member of Technical Staff Network Appliance
On Thu, 17 Sep 1998, Christoph Doerbeck wrote:
Naturally, the first response from Digital is that it is likely to be a NetApp problem, and the first response from NetApp is that it is likely to be a Digital Unix problem. Any suggestions on how to determine where the problem lies?
I would guess that if you exported a filesystem from a SUN, and could repeat the problem on the DEC box, you would have good ammo to chew out someone at DEC. If the Sun export works, then start chewing on NetApp.
!#/bin/csh # a few modifications (use lptest instead of printcap)
set num=0 while ( $num < 100 ) @ num = $num + 1 lptest > reallyreallyverylongnamedatafile.$num ln -s reallyreallyverylongnamedatafile.$num 1.$num echo "$num is done" end echo "And the magic answer is..." ls | wc exit
On sparc 20, mounting from a F330 running 4.0.1c: dunsel% ls | wc 201 201 4119
- Christoph
- Hey greg, remember dunsel???
We are investigating this issue internally. We have reproduced it. Our intial investigations show that the Digital Unix client sends a readdir request and we respond to it. We mark in our reply that we are not at the end of the directory, but the client never comes back for more. Our suspicions are that the client is deciding it has all of the directory information based on some client side cached information or the cookie we pass back. We have not yet determined with certainty whether it is a client or server bug.
For what it's worth, we had a similar bug with Sun clients and NFSv2 at one point.
The problem then was that we were sending back a packet with the "I have more directory data flag" set, but we didn't actually include any directory entries in the packet. Despite the "I have more directory data flag" being set, the client didn't ask for any data.
The chances seem low that this same issue re-emerged in NFSv3, but I thought the info might be useful anyway.
Dave
Based on our experiments, and observed behavior of Digital Unix clients we now believe that this problem is due to a client bug. More specifically, this appears to be caused by a client illegally using the (supposedly opaque) 'cookie' that the filer sends as part of a 'readdir' response to determine if the directory has grown.
In this case, because the 'cookie' in the response to its first 'readdir' request matches its cached copy, the client appears to incorrectly conclude that the directory hasn't grown, and so does not send any futher 'readdir' requests. The client also ignores the fact that end-of-file is not set in the response.
We were also able to modify the test case so that it misbahaves even when run against a Solaris 2.5.1 NFS server.
The reason why this problem would occur much more commonly with a NetApp filer is related to the way directory compaction is done on the filer as compared to the Solaris server, and also the amount of directory data returned in a 'readdir' response.
The modified test script follows.
--------------------------- #!/usr/ucb/csh
# create a log of datafiles and links to test nfs problem
set num = 0 while ( $num < 16 ) @ num = $num + 1 cat /etc/zoneinfo > rereallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallylongnamedatafile.$num echo "$num is done" end
ls | wc ls | wc
set num = 32 while ( $num < 64 ) @ num = $num + 1 cat /etc/zoneinfo > reallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallyreallylongnamedatafile.$num echo "$num is done" end
ls | wc
echo done
exit ---------------------------
Rajesh Sundaram
We are investigating this issue internally. We have reproduced it. Our intial investigations show that the Digital Unix client sends a readdir request and we respond to it. We mark in our reply that we are not at the end of the directory, but the client never comes back for more. Our suspicions are that the client is deciding it has all of the directory information based on some client side cached information or the cookie we pass back. We have not yet determined with certainty whether it is a client or server bug.
John Edwards Member of Technical Staff Network Appliance