New subject: NFS fundamentally broken ??? (HA implications)

11 Mar 1999


      On Wed 10 Mar, 1999, Philip Thomas thomas@act.sps.mot.com wrote:
...
Hi,
An experienced sys admin and a colleague of mine argue that
NFS is "fundamentally broken", in the sense that because of
it's stateless architecture, a NFS client can "hang" during
shutdown if it's NFS server(s) does not respond.
This is aggravated when an "ack" is lost between server/client.
He claims to have gone through NFS code while he was at a prestigious university.
Granted I am not an expert, all the years I have worked with NFS,
I never had a problem shutting down a system because it's NFS
mount was not "responding". 
But I am hoping for folks with "theoretical" background to shed some
more light into the above argument.
Both from experience, and from what I know of the workings of NFS, it *is*
possible for a client to hang for valid reasons, even when shutting down,
on mounts from unresponsive NFS servers.
You can work around this by using dangerous options like soft mounts,
or by killing processes that are using the NFS mounted areas and forcing
the client to umount problem mounts.
Excitingly, NetApp's use clever status-monitoring such that they can
maintain NFS locks (even though they're just advisory) when the server
reboots, and they, as servers, can suffer from clients that have died
away with open locks. NetApp's spend some time every few minutes looking
for such clients, and slowing down a bit while they do so. They also
syslog a lot about that sort of thing. So there's a bit of a two-way
street here, even though NFS is putatively stateless.
We recently hit a bug related to that and a very helpful NetApp engineer
(who I won't name to save him any embarrassment) is looking at how to
smooth out that sort of wrinkle in the future - in fact at least part
of that is going into a patch release in the near future.
Check out the differences between intr and non-intr, and soft and hard
mount options in the man-pages, and Hal Stern's book - I think that
might make things clearer.
...
This brings to the second question.
The same person emphasize the above issue as an added reason for not
having NFS mount points on his HP High Availability (ServiceGuard).
Is there any site out there, in this wide world, using HP's ServiceGuard
(or any other HA) on a system that is an NFS client and would like to
comment on their experience in private or in this forum.
Ah. There he has a point in principle, but in practicality I'd expect
there to be no particular issue.
We run a couple of Sun HA1.3 clusters here, and there's a Sun Cluster 2.x
going in that I've had some sight of. They all use NFS mounts to import
filesystems from servers, both Unix and NetApp. The design of HA clusters
basically hinges on timeouts (as NFS does itself, thinking about it). If
a node in an HA cluster must go down, it *will*, even if it's hanging
on NFS mounts from elsewhere - usually because one timeout or other will
trigger it's suicide, or for a sibling to commit fratricide (sororicide?).
Obviously suicide or fratricide leaves a machine open to problems where
neat and tidy shutdown without hanging on NFS mounts will be far superior.
The fact that one node is having problems with an NFS mount probably means
the other node(s) will too, and that will impact the overall HA cluster,
which is bad(tm). So the secret is to make sure any mounts in an HA
cluster are (a) required absolutely for the function the cluster provides
and (b) are at least as reliable as the cluster is intended to be.
I'm happy to use our NetApp's fs's on our HA clusters. I happen to think
the group implementing the Cluster 2.x environment are storing up problems
for themselves by using Unix NFS servers. Your mileage will vary.
Hope that's helpful.
...
Philip Thomas
Motorola - PEL, M/S M350
2200 W. Broadway M350
Mesa, AZ 85202
rxjs80@email.sps.mot.com
(602) 655-3678        
(602) 655-2285 (fax)  
-- End of excerpt from Philip Thomas

Re: NFS fundamentally broken ??? (HA implications)