Scenario:
clustered F840's, ONTAP 5.3.7R2 one is a busy filer of mostly home directories other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one CPU is pegged at 100% very little NFS, CIFS, or network traffic no backups or restores going on no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known populated directory returns empty. Effectively the filer is not serving data. (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 10% 0 0 0 10 6 12074 0 0 0 24 47% 0 0 0 8 5 6369 7 0 0 24 47% 0 0 0 13 7 16479 20258 0 0 24 33% 0 0 0 4 3 11858 13180 0 0 24 34% 0 0 0 8 4 12698 14703 0 0 24 9% 0 0 0 9 5 11392 8 0 0 24 9% 0 0 0 7 4 11097 0 0 0 24 58% 0 0 0 9 3 6453 2415 0 0 24 39% 0 0 0 7 2 15218 17034 0 0 24 39% 0 0 0 5 2 13560 16924 0 0 24 31% 0 0 0 9 5 11633 11593 0 0 24 8% 0 0 0 8 4 10634 8 0 0 24 10% 0 0 0 9 5 12992 0 0 0 23 62% 0 0 0 8 3 9828 8030 0 0 23 44% 0 0 0 9 4 17156 19024 0 0 23 37% 0 0 0 6 3 15229 17994 0 0 23 9% 0 0 0 6 2 12204 8 0 0 23 10% 0 0 0 11 5 13574 8 0 0 23 66% 0 0 0 5 3 9354 11421 0 0 23
Pardon my French, but WTF is this filer doing? It looks and smells like snapshot behaviour, but we weren't even near to the time it should be doing a snapshot via the schedule. No external scripts would initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp` to check on the consistency points, since we are suspecting this poor filer is write-bound (insufficient NVRAM cache--half is "lost" to the partner for clustering). The counts of all the cp_* parameters show *less* than the minimum number of consistency points expected (uptime times 6 per minute, i.e., a minimum of once every 10 seconds). And, of course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this? Any idea what is going on? Are we just beating the crap out of this thing, and it gives up the ghost by pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in testing, soon to be deployed. Not soon enough. Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
Could snapmirror have started a transfer to a local volume that starts midday? The CIFS terminate clearing things up a bit makes me lean away from snapmirror.
mikef
Scenario:
clustered F840's, ONTAP 5.3.7R2 one is a busy filer of mostly home directories other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one CPU is pegged at 100% very little NFS, CIFS, or network traffic no backups or restores going on no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known populated directory returns empty. Effectively the filer is not serving data. (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 10% 0 0 0 10 6 12074 0 0 0 24 47% 0 0 0 8 5 6369 7 0 0 24 47% 0 0 0 13 7 16479 20258 0 0 24 33% 0 0 0 4 3 11858 13180 0 0 24 34% 0 0 0 8 4 12698 14703 0 0 24 9% 0 0 0 9 5 11392 8 0 0 24 9% 0 0 0 7 4 11097 0 0 0 24 58% 0 0 0 9 3 6453 2415 0 0 24 39% 0 0 0 7 2 15218 17034 0 0 24 39% 0 0 0 5 2 13560 16924 0 0 24 31% 0 0 0 9 5 11633 11593 0 0 24 8% 0 0 0 8 4 10634 8 0 0 24 10% 0 0 0 9 5 12992 0 0 0 23 62% 0 0 0 8 3 9828 8030 0 0 23 44% 0 0 0 9 4 17156 19024 0 0 23 37% 0 0 0 6 3 15229 17994 0 0 23 9% 0 0 0 6 2 12204 8 0 0 23 10% 0 0 0 11 5 13574 8 0 0 23 66% 0 0 0 5 3 9354 11421 0 0 23
Pardon my French, but WTF is this filer doing? It looks and smells like snapshot behaviour, but we weren't even near to the time it should be doing a snapshot via the schedule. No external scripts would initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp` to check on the consistency points, since we are suspecting this poor filer is write-bound (insufficient NVRAM cache--half is "lost" to the partner for clustering). The counts of all the cp_* parameters show *less* than the minimum number of consistency points expected (uptime times 6 per minute, i.e., a minimum of once every 10 seconds). And, of course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this? Any idea what is going on? Are we just beating the crap out of this thing, and it gives up the ghost by pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in testing, soon to be deployed. Not soon enough. Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
On Tue, 17 Jul 2001, Mike Federwisch wrote:
Could snapmirror have started a transfer to a local volume that starts midday? The CIFS terminate clearing things up a bit makes me lean away from snapmirror.
I wrote:
We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Whoops. No snapmirror, either. Just NFS, CIFS, and cluster licensed.
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
Dear Todd,
I have got something similar (but not on a cluster). We had a very high load on a F740 with 5.3.7R2. When he runs a long time at high CPU (95%), on a moment the cpu was 100% but he didnt do anything (just less then 100 CIFS/s where there must be more then 1500 CIFS/s). Very very slow response was the result. After terminating Cifs, he was also busy for a time but after a few minutes the cpu droped to zero. In some cases, it helps to restart the cifs. One time, we had to reboot the system.
The first diagnoses was the we just use one 100Mb nic. So we activated our GB nic, and that was much better. But the problem was solved for 100% when we did the upgrade tot 6.1R1. The load of the CPU was less then 5.x ( for the same output) and when there was a high cpu load, the filer keep running.
I hoop this will help you,
Best regards,
Reinoud UZ Leuven Belgium
----- Original Message ----- From: "Todd C. Merrill" tmerrill@mathworks.com To: toasters@mathworks.com Sent: Wednesday, July 18, 2001 12:19 AM Subject: pegged filer....write bound?
Scenario:
clustered F840's, ONTAP 5.3.7R2 one is a busy filer of mostly home directories other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one CPU is pegged at 100% very little NFS, CIFS, or network traffic no backups or restores going on no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known populated directory returns empty. Effectively the filer is not serving data. (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read write
age
10% 0 0 0 10 6 12074 0 0 0
24
47% 0 0 0 8 5 6369 7 0 0
24
47% 0 0 0 13 7 16479 20258 0 0
24
33% 0 0 0 4 3 11858 13180 0 0
24
34% 0 0 0 8 4 12698 14703 0 0
24
9% 0 0 0 9 5 11392 8 0 0
24
9% 0 0 0 7 4 11097 0 0 0
24
58% 0 0 0 9 3 6453 2415 0 0
24
39% 0 0 0 7 2 15218 17034 0 0
24
39% 0 0 0 5 2 13560 16924 0 0
24
31% 0 0 0 9 5 11633 11593 0 0
24
8% 0 0 0 8 4 10634 8 0 0
24
10% 0 0 0 9 5 12992 0 0 0
23
62% 0 0 0 8 3 9828 8030 0 0
23
44% 0 0 0 9 4 17156 19024 0 0
23
37% 0 0 0 6 3 15229 17994 0 0
23
9% 0 0 0 6 2 12204 8 0 0
23
10% 0 0 0 11 5 13574 8 0 0
23
66% 0 0 0 5 3 9354 11421 0 0
23
Pardon my French, but WTF is this filer doing? It looks and smells like snapshot behaviour, but we weren't even near to the time it should be doing a snapshot via the schedule. No external scripts would initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp` to check on the consistency points, since we are suspecting this poor filer is write-bound (insufficient NVRAM cache--half is "lost" to the partner for clustering). The counts of all the cp_* parameters show *less* than the minimum number of consistency points expected (uptime times 6 per minute, i.e., a minimum of once every 10 seconds). And, of course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this? Any idea what is going on? Are we just beating the crap out of this thing, and it gives up the ghost by pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in testing, soon to be deployed. Not soon enough. Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
On Tue, 17 Jul 2001, Todd C. Merrill wrote:
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 10% 0 0 0 10 6 12074 0 0 0 24 47% 0 0 0 8 5 6369 7 0 0 24 47% 0 0 0 13 7 16479 20258 0 0 24 33% 0 0 0 4 3 11858 13180 0 0 24 34% 0 0 0 8 4 12698 14703 0 0 24
As a follow-up to this, we never found out why the filer stayed mental after all services were turned off, but we found what made it *start* to go mental. A bad entry in our WINS database caused the filer to try to authenticate against either non-existent or topologically distant domain controllers. The CIFSAuthen process on the filer showed it taking up most of the CPU resources (50-60%), apparently in a wait state, waiting for a timeout or waiting for the slow response from a distant DC. Removing the bad/distant entries has apparently left the filer with a closer, faster set of DC's against which to choose to authenticate.
If you'll recall, this problem was with ONTAP 5.3.7. We have since found out that in ONTAP 6.x, there is the command:
cifs prefdc
to give the filer a hard-coded list of (fast and local) DC's against which to authenticate, similar to the `options nis.servers` for NIS servers. We are now accelerating our upgrade plan to take advantage of this feature. ;)
Thanks to all who replied publically and privately with suggestions. Thanks to the NetApp admin class 202 for all kinds of cool rc_toggle_basic goodies. And, thanks to the NetApp tech support folks and our local SE for the interpretation of all of this data. It was, in the best sense of the word, a team effort in nailing this problem down from the deceptive symptoms.
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---