On 11/20/98 17:57:42 you wrote:
I know this isn't a support channel, but Netapp's not been useful at all, so what the heck:
I have an F630 that's run well enough for five months. On Sunday we upgraded our filer to 5.1.2 from 5.0.1d4 at the suggestion of our Netapp SE.
Since then, it's come grinding to a halt each day. NFS access starts getting really slow, and eventually completely unresponsive. Network activity is not high. sysstat shows that cpu, disk, and net are all getting lower and lower as clients start dropping their mounts. Things do not improve until eventually we reboot it. The telnet interface is fine and responsive all the way through.
While many things could cause this, this sounds to me like a classic resource contention issue, where the filer accumulates more and more operations that are waiting on a blocked process (like a raid stripe being written out), and eventually it can't do anything more. This sort of thing can also happen at a higher level in the networking code. There have certainly been bugs that have exhibited the symptoms you describe in the past.
Netapp's support has asked for crash dumps, etc., but nothing's helped, and we're heading into a weekend. They also recommended *not* downgrading back to 5.0.1d4. Apparently 5.1.2 has bug fixes that we require(news to me).
Has *anyone* seen this? Suggestions? Thoughts?
I'm surprised the crash dumps haven't helped. My first recommendation would be to downgrade. The "apparently" doesn't cut it; either they can tell you specifically what was fixed and you can assess the risk of downgrading, or you should just do it anyway. But if doing so would be a big problem according to Netapp, then you need to look for other solutions.
If that is the case, I assume you've already looked at the obvious culprits like making sure you're not doing a dump, restore, etc. when this happens. After eliminating that, my best advice would be to change the client mounts to use different NFS protocols... UDP instead of TCP, and/or v2 instead of v3. Doing so may prevent or may make it less likely that you'll 'tickle' the Netapp bug. Also, check out the different interfaces you have... perhaps switching from ATM to Gigabit, or from Gigabit to 100tx, may alleviate the problem. (If you can check mounts through each interface when the "slowdown" occurs, you may be able to confirm a particular interface as the culprit.)
Bruce
On Fri, 20 Nov 1998 sirbruce@ix.netcom.com wrote:
I'm surprised the crash dumps haven't helped. My first recommendation would be to downgrade. The "apparently" doesn't cut it; either they can tell you specifically what was fixed and you can assess the risk of downgrading, or you should just do it anyway. But if doing so would be a big problem according to Netapp, then you need to look for other solutions.
Aside from the nasty bugs in ndmp, which I hope have been fixed in 5.1.2, I have no reason other than their word not to downgrade. If I hear nothing tonight, I will be downgrading anyway.
I'm currently waiting for someone at netapp who is apparently looking over my crashdump right now. He said five minutes...it's been twenty...I can only be so patient...
If that is the case, I assume you've already looked at the obvious culprits like making sure you're not doing a dump, restore, etc. when
Checked, and no.
this happens. After eliminating that, my best advice would be to change the client mounts to use different NFS protocols... UDP instead of TCP, and/or v2 instead of v3. Doing so may prevent or may make it less likely that you'll 'tickle' the Netapp bug. Also, check out the different
Well, we have mostly v2 clients, but the v3 tcp client and the v3 udp clients have all seen exactly the same behaviour.
interfaces you have... perhaps switching from ATM to Gigabit, or from Gigabit to 100tx, may alleviate the problem. (If you can check mounts through each interface when the "slowdown" occurs, you may be able to confirm a particular interface as the culprit.)
Checked, and both fddi (two of them) and the 100fdx are all having the same issue.
I've really beat at this, and I'm rather positive it's a bug.