NFS hangs from linux clients

List overview All Threads
Download

newer

older

RE: iSCSI and OnTap 7 on F720

Re: NFS hangs from linux clients

Gil Freund

27 Jan 2005 27 Jan '05

4:25 p.m.

Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

Show replies by date

Sto Rage©

27 Jan 27 Jan

5:57 p.m.

Are you sure you are running 6.5.3 ? We had a very similar issue when we upgraded to 6.5.2 last year. The BUG was identified as 139351 and fixed in 6.5.2P6 (http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=139351) Hope it hasn't re-appeared in 6.5.3 ! -G

On Thu, 27 Jan 2005 18:25:04 +0200, Gil Freund gilf@sysnet.co.il wrote:

...

Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

Gil Freund

28 Jan 28 Jan

6:13 a.m.

Sto Rage© wrote:

...

Are you sure you are running 6.5.3 ? We had a very similar issue when we upgraded to 6.5.2 last year. The BUG was identified as 139351 and fixed in 6.5.2P6 (http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=139351) Hope it hasn't re-appeared in 6.5.3 !

This is not (as far as I can tell) a resolver issue. Nodes are resolved and seen both ways (clients to netapp and back).

...

-G

On Thu, 27 Jan 2005 18:25:04 +0200, Gil Freund gilf@sysnet.co.il wrote:

...
Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

Tom Haynes

27 Jan 27 Jan

6:05 p.m.

On Thu, Jan 27, 2005 at 06:25:04PM +0200, Gil Freund wrote:

...

Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

Check your autosupport messages for nfsstat -d output from that period. If you aren't doing autosupport, check your messages file to see if perhaps someone ran that command when this was happening.

Look for the following lines: (cumulative) active=0/491 req mbufs=0 nfs msgs counts: tot=491, unallocated=427, free=64, used=0, VM cb heard=0, VM cb done=0

If your used == tot, and that never changes, then that means you have run out of internal messages to transfer data between NFS and WAFL.

In that case, we don't take new NFS requests and for all intents and purposes, the filer is dead with respect to NFS.

Next time it happens, please consider inducing a core dump and sharing that with NetApp. I.e., there are bugs in 6.5.2R1 which will cause you to run out of messages, but those are all fixed in the 6.5.3 code lines.

If you are seeing this type of resource exhaustion, I'd like the data to be able to track this down and fix the issue.

-- Tom Haynes, ex-cfb thomas@netapp.com

Gil Freund

28 Jan 28 Jan

11:50 a.m.

Tom Haynes wrote:

...

On Thu, Jan 27, 2005 at 06:25:04PM +0200, Gil Freund wrote:

...
Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

Check your autosupport messages for nfsstat -d output from that period. If you aren't doing autosupport, check your messages file to see if perhaps someone ran that command when this was happening.

Look for the following lines: (cumulative) active=0/491 req mbufs=0 nfs msgs counts: tot=491, unallocated=427, free=64, used=0, VM cb heard=0, VM cb done=0

I've looked at the autosupport report: (cumulative) active=0/64 req mbufs=0 tcp no msg dropped=0, no msg unallocated=0 tcp input flowcontrol receive=0, xmit=0 no delegation=0, read delegation=0 nfs msgs counts: tot=64, unallocated=0, free=64, used=0

Used is 0, which is strange, as NFS clients were active until the hangup.

Full NFS section:

nfs cache size=4096, hash size=8192

num msg=0, too many mbufs=0, rpcErr=0, svrErr=0 no msg queued=0, no msg paused=0, no msg dropped=0, no msg unallocated=0 no msg unqueued=0, no msg discarded=0 no msg dropped from vol offline=0, no deferred msg processed=0 sbfull queued=0, sbfull unqueued=0, sbfull discarded=0 no mbuf queued=0, no mbuf dropped=0 no mbuf unqueued=0, no mbuf discarded=0 (cumulative) active=0/64 req mbufs=0 tcp no msg dropped=0, no msg unallocated=0 tcp input flowcontrol receive=0, xmit=0 no delegation=0, read delegation=0 nfs msgs counts: tot=64, unallocated=0, free=64, used=0 nfs reply cache counts: tot=4096, unallocated=4032, free=64, used=0 v4 reply cache opinfo: tot=1184, unallocated=1120, free=64, normal=0, rcache=0 v4 reply cache wafl msgs: tot=444, unallocated=380, free=64, normal=0, rcache=0 v2 mount (requested, granted, denied) = (0, 0, 0) v2 unmount (requested, granted, denied) = (0, 0, 0) v2 unmount all (requested, granted, denied) = (0, 0, 0) v3 mount (requested, granted, denied) = (3, 3, 0) v3 unmount (requested, granted, denied) = (0, 0, 0) v3 unmount all (requested, granted, denied) = (0, 0, 0) access cache (hits, misses) = (7, 6) access cache lookup requests (curr, total, max) = (0, 2, 1) access cache (loaded, max) = (4, 4) access cache thread signals (scrub, fill) = (1, 2) access cache flushes during (scrub, fill, flush) = (0, 0, 0) access cache harvests during (scrub, fill, flush) = (0, 0, 0)

...

If your used == tot, and that never changes, then that means you have run out of internal messages to transfer data between NFS and WAFL.

In that case, we don't take new NFS requests and for all intents and purposes, the filer is dead with respect to NFS.

Next time it happens, please consider inducing a core dump and sharing that with NetApp. I.e., there are bugs in 6.5.2R1 which will cause you to run out of messages, but those are all fixed in the 6.5.3 code lines.

If you are seeing this type of resource exhaustion, I'd like the data to be able to track this down and fix the issue.

Constantin Bogomolnyi

2:03 p.m.

Hello,

I had some similar problems beacuse of a bad configuration of a tinydns cache .

Can you resolve the reverse for your ip ? I know this question seems to be stupid BUT if you use tinydns as network cache and you dont declare your networks for reverse resolving you will see very funny nfs problems exactly as you see now .

On Thu, Jan 27, 2005 at 06:24:42PM +0200, Gil Freund wrote:

...

Hi,

We have three Debian systems (two with 2.6.x Kernel and one with 2.4) using a NetApp 720 (6.5.3) as an NFS server.

On two occasions in the last two weeks, when one of the Linux hosts tried to mount a file system from the netapp we got the error: Host nono not responding. (nono being the filer name)

From that moment, no action could be preformed from any of the Linux hosts on the netapp. We see not errors in the log files of either the Linux nor the netapp. Other functions (CIFS, NIS and DNS lookups) continue with not issues.

Attempting to stop and start the NFS service on the netapp did not have any effect, neither did rebooting the clients.

The only workaround was rebooting the filer, after which all NFS operations resumed without a problem.

I have started a trace on NFS, but did not get the same hangup yet.

Did anyone encounter a similar situation, or alternatively can advise on where and what to check?

Thanks

Gil

7510

Age (days ago)

7511

Last active (days ago)

toasters@lists.teaparty.net

5 comments

4 participants

tags (0)

participants (4)

Constantin Bogomolnyi
Gil Freund
Sto Rage©
Tom Haynes