I'm evaluating NFS server appliances to be used by a couple of dozen clients (currently all solaris, but possibly linux and hpux in future) all running a distributed application. We're looking at how clusters handle failing over NFS serving from one netapp to another.
From the research we've done on netapp and other vendors, it looks like it takes about 20 seconds for the nfs server daemon to come back, but the thing that kills many vendors is that client lock recovery is slow and/or buggy. The lockd grace period is tunable down from 45s on most servers, but that still means you are down for 65seconds, and that's pretty painful. And that's assuming that the server implementation is clever enough to correctly transfer knowledge about locks from the dead node to the active server node, something which we have seen not to happen on at least one major vendor's implementation.
Has anyone share their experiences of failing over clustered netapps for nfs? What kind of failover times do you see (including lock recovery)? Does anything not work properly?
tia
--herb
_________________________________________________________________ Protect your PC - get McAfee.com VirusScan Online http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963