On Wed 10 Mar, 1999, Philip Thomas thomas@act.sps.mot.com wrote:
Hi, An experienced sys admin and a colleague of mine argue that NFS is "fundamentally broken", in the sense that because of it's stateless architecture, a NFS client can "hang" during shutdown if it's NFS server(s) does not respond. This is aggravated when an "ack" is lost between server/client. He claims to have gone through NFS code while he was at a prestigious university. Granted I am not an expert, all the years I have worked with NFS, I never had a problem shutting down a system because it's NFS mount was not "responding". But I am hoping for folks with "theoretical" background to shed some more light into the above argument.
Both from experience, and from what I know of the workings of NFS, it *is* possible for a client to hang for valid reasons, even when shutting down, on mounts from unresponsive NFS servers.
You can work around this by using dangerous options like soft mounts, or by killing processes that are using the NFS mounted areas and forcing the client to umount problem mounts.
Excitingly, NetApp's use clever status-monitoring such that they can maintain NFS locks (even though they're just advisory) when the server reboots, and they, as servers, can suffer from clients that have died away with open locks. NetApp's spend some time every few minutes looking for such clients, and slowing down a bit while they do so. They also syslog a lot about that sort of thing. So there's a bit of a two-way street here, even though NFS is putatively stateless.
We recently hit a bug related to that and a very helpful NetApp engineer (who I won't name to save him any embarrassment) is looking at how to smooth out that sort of wrinkle in the future - in fact at least part of that is going into a patch release in the near future.
Check out the differences between intr and non-intr, and soft and hard mount options in the man-pages, and Hal Stern's book - I think that might make things clearer.
This brings to the second question. The same person emphasize the above issue as an added reason for not having NFS mount points on his HP High Availability (ServiceGuard). Is there any site out there, in this wide world, using HP's ServiceGuard (or any other HA) on a system that is an NFS client and would like to comment on their experience in private or in this forum.
Ah. There he has a point in principle, but in practicality I'd expect there to be no particular issue.
We run a couple of Sun HA1.3 clusters here, and there's a Sun Cluster 2.x going in that I've had some sight of. They all use NFS mounts to import filesystems from servers, both Unix and NetApp. The design of HA clusters basically hinges on timeouts (as NFS does itself, thinking about it). If a node in an HA cluster must go down, it *will*, even if it's hanging on NFS mounts from elsewhere - usually because one timeout or other will trigger it's suicide, or for a sibling to commit fratricide (sororicide?).
Obviously suicide or fratricide leaves a machine open to problems where neat and tidy shutdown without hanging on NFS mounts will be far superior. The fact that one node is having problems with an NFS mount probably means the other node(s) will too, and that will impact the overall HA cluster, which is bad(tm). So the secret is to make sure any mounts in an HA cluster are (a) required absolutely for the function the cluster provides and (b) are at least as reliable as the cluster is intended to be.
I'm happy to use our NetApp's fs's on our HA clusters. I happen to think the group implementing the Cluster 2.x environment are storing up problems for themselves by using Unix NFS servers. Your mileage will vary.
Hope that's helpful.
Philip Thomas Motorola - PEL, M/S M350 2200 W. Broadway M350 Mesa, AZ 85202 rxjs80@email.sps.mot.com (602) 655-3678 (602) 655-2285 (fax) -- End of excerpt from Philip Thomas