Dear Todd,
I have got something similar (but not on a cluster). We had a very high load on a F740 with 5.3.7R2. When he runs a long time at high CPU (95%), on a moment the cpu was 100% but he didnt do anything (just less then 100 CIFS/s where there must be more then 1500 CIFS/s). Very very slow response was the result. After terminating Cifs, he was also busy for a time but after a few minutes the cpu droped to zero. In some cases, it helps to restart the cifs. One time, we had to reboot the system.
The first diagnoses was the we just use one 100Mb nic. So we activated our GB nic, and that was much better. But the problem was solved for 100% when we did the upgrade tot 6.1R1. The load of the CPU was less then 5.x ( for the same output) and when there was a high cpu load, the filer keep running.
I hoop this will help you,
Best regards,
Reinoud UZ Leuven Belgium
----- Original Message ----- From: "Todd C. Merrill" tmerrill@mathworks.com To: toasters@mathworks.com Sent: Wednesday, July 18, 2001 12:19 AM Subject: pegged filer....write bound?
Scenario:
clustered F840's, ONTAP 5.3.7R2 one is a busy filer of mostly home directories other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one CPU is pegged at 100% very little NFS, CIFS, or network traffic no backups or restores going on no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known populated directory returns empty. Effectively the filer is not serving data. (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read write
age
10% 0 0 0 10 6 12074 0 0 0
24
47% 0 0 0 8 5 6369 7 0 0
24
47% 0 0 0 13 7 16479 20258 0 0
24
33% 0 0 0 4 3 11858 13180 0 0
24
34% 0 0 0 8 4 12698 14703 0 0
24
9% 0 0 0 9 5 11392 8 0 0
24
9% 0 0 0 7 4 11097 0 0 0
24
58% 0 0 0 9 3 6453 2415 0 0
24
39% 0 0 0 7 2 15218 17034 0 0
24
39% 0 0 0 5 2 13560 16924 0 0
24
31% 0 0 0 9 5 11633 11593 0 0
24
8% 0 0 0 8 4 10634 8 0 0
24
10% 0 0 0 9 5 12992 0 0 0
23
62% 0 0 0 8 3 9828 8030 0 0
23
44% 0 0 0 9 4 17156 19024 0 0
23
37% 0 0 0 6 3 15229 17994 0 0
23
9% 0 0 0 6 2 12204 8 0 0
23
10% 0 0 0 11 5 13574 8 0 0
23
66% 0 0 0 5 3 9354 11421 0 0
23
Pardon my French, but WTF is this filer doing? It looks and smells like snapshot behaviour, but we weren't even near to the time it should be doing a snapshot via the schedule. No external scripts would initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp` to check on the consistency points, since we are suspecting this poor filer is write-bound (insufficient NVRAM cache--half is "lost" to the partner for clustering). The counts of all the cp_* parameters show *less* than the minimum number of consistency points expected (uptime times 6 per minute, i.e., a minimum of once every 10 seconds). And, of course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this? Any idea what is going on? Are we just beating the crap out of this thing, and it gives up the ghost by pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in testing, soon to be deployed. Not soon enough. Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com