Re: pegged filer....write bound?

18 Jul 2001


      Dear Todd,
I have got something similar (but not on a cluster). We had a very high load
on a F740 with 5.3.7R2. When he runs a long time at high CPU (95%), on a
moment the cpu was 100% but he didnt do anything (just less then 100 CIFS/s
where there must be more then 1500 CIFS/s). Very very slow response was the
result. After terminating Cifs, he was also busy for a time but after a few
minutes the cpu droped to zero. In some cases, it helps to restart the cifs.
One time, we had to reboot the system.
The first diagnoses was the we just use one 100Mb nic. So we activated our
GB nic, and that was much better. But the problem was solved for 100% when
we did the upgrade tot 6.1R1. The load of the CPU was less then 5.x ( for
the same output) and when there was a high cpu load, the filer keep running.
I hoop this will help you,
Best regards,
Reinoud
UZ Leuven
Belgium
----- Original Message -----
From: "Todd C. Merrill" tmerrill@mathworks.com
To: toasters@mathworks.com
Sent: Wednesday, July 18, 2001 12:19 AM
Subject: pegged filer....write bound?
...
Scenario:
clustered F840's, ONTAP 5.3.7R2
one is a busy filer of mostly home directories
other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one
CPU is pegged at 100%
very little NFS, CIFS, or network traffic
no backups or restores going on
no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known
populated directory returns empty.   Effectively the filer is not serving
data.  (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit.  Still CPU is pegged.
We terminate CIFS to see if that is.  CPU drops down, but not to zero.
We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home>  sysstat 2
 CPU    NFS   CIFS   HTTP      Net kB/s    Disk kB/s     Tape kB/s
Cache
...
                           in   out    read write    read write

age
...
10%      0      0      0      10     6   12074     0       0     0
24
...
47%      0      0      0       8     5    6369     7       0     0
24
...
47%      0      0      0      13     7   16479 20258       0     0
24
...
33%      0      0      0       4     3   11858 13180       0     0
24
...
34%      0      0      0       8     4   12698 14703       0     0
24
...
9%      0      0      0       9     5   11392     8       0     0
24
...
9%      0      0      0       7     4   11097     0       0     0
24
...
58%      0      0      0       9     3    6453  2415       0     0
24
...
39%      0      0      0       7     2   15218 17034       0     0
24
...
39%      0      0      0       5     2   13560 16924       0     0
24
...
31%      0      0      0       9     5   11633 11593       0     0
24
...
8%      0      0      0       8     4   10634     8       0     0
24
...
10%      0      0      0       9     5   12992     0       0     0
23
...
62%      0      0      0       8     3    9828  8030       0     0
23
...
44%      0      0      0       9     4   17156 19024       0     0
23
...
37%      0      0      0       6     3   15229 17994       0     0
23
...
9%      0      0      0       6     2   12204     8       0     0
23
...
10%      0      0      0      11     5   13574     8       0     0
23
...
66%      0      0      0       5     3    9354 11421       0     0
23
...
Pardon my French, but WTF is this filer doing?  It looks and smells
like snapshot behaviour, but we weren't even near to the time it
should be doing a snapshot via the schedule.  No external scripts would
initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp`
to check on the consistency points, since we are suspecting this poor
filer is write-bound (insufficient NVRAM cache--half is "lost" to the
partner for clustering).  The counts of all the cp_* parameters show
*less* than the minimum number of consistency points expected (uptime
times 6 per minute, i.e., a minimum of once every 10 seconds).  And, of
course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this?  Any idea what is going on?  Are we
just beating the crap out of this thing, and it gives up the ghost by
pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in
testing, soon to be deployed.  Not soon enough.  Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792
3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX
tmerrill@mathworks.com http://www.mathworks.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: pegged filer....write bound?