Hello Toasters,
Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?
Our Environment:
ESXi 5.5 F3240 ONTAP 8.1.2P4
It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.
Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.
We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.
vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.
Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.
Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.
Thanks in advance for any insight/experiences.
Phil
I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...
We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.
Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.
Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.
On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <philbertrupkins@gmail.com
wrote:
Hello Toasters,
Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?
Our Environment:
ESXi 5.5 F3240 ONTAP 8.1.2P4
It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.
Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.
We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.
vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.
Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.
Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.
Thanks in advance for any insight/experiences.
Phil