On 11/20/98 17:57:42 you wrote:
>
>
>I know this isn't a support channel, but Netapp's not been useful at all,
>so what the heck:
>
>I have an F630 that's run well enough for five months. On Sunday we
>upgraded our filer to 5.1.2 from 5.0.1d4 at the suggestion of our Netapp
>SE.
>
>Since then, it's come grinding to a halt each day. NFS access starts
>getting really slow, and eventually completely unresponsive. Network
>activity is not high. sysstat shows that cpu, disk, and net are all
>getting lower and lower as clients start dropping their mounts. Things do
>not improve until eventually we reboot it. The telnet interface is fine
>and responsive all the way through.
While many things could cause this, this sounds to me like a classic
resource contention issue, where the filer accumulates more and more
operations that are waiting on a blocked process (like a raid stripe
being written out), and eventually it can't do anything more. This sort
of thing can also happen at a higher level in the networking code. There
have certainly been bugs that have exhibited the symptoms you describe
in the past.
>Netapp's support has asked for crash dumps, etc., but nothing's helped,
>and we're heading into a weekend. They also recommended *not* downgrading
>back to 5.0.1d4. Apparently 5.1.2 has bug fixes that we require(news to
>me).
>
>Has *anyone* seen this? Suggestions? Thoughts?
I'm surprised the crash dumps haven't helped. My first recommendation
would be to downgrade. The "apparently" doesn't cut it; either they can
tell you specifically what was fixed and you can assess the risk of
downgrading, or you should just do it anyway. But if doing so would be
a big problem according to Netapp, then you need to look for other
solutions.
If that is the case, I assume you've already looked at the obvious
culprits like making sure you're not doing a dump, restore, etc. when
this happens. After eliminating that, my best advice would be to change
the client mounts to use different NFS protocols... UDP instead of TCP,
and/or v2 instead of v3. Doing so may prevent or may make it less
likely that you'll 'tickle' the Netapp bug. Also, check out the different
interfaces you have... perhaps switching from ATM to Gigabit, or from
Gigabit to 100tx, may alleviate the problem. (If you can check mounts
through each interface when the "slowdown" occurs, you may be able to
confirm a particular interface as the culprit.)
Bruce