pegged filer....write bound? - toasters

17 Jul 2001


      Scenario:
clustered F840's, ONTAP 5.3.7R2
one is a busy filer of mostly home directories
other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one
CPU is pegged at 100%
very little NFS, CIFS, or network traffic
no backups or restores going on
no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known
populated directory returns empty.   Effectively the filer is not serving
data.  (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit.  Still CPU is pegged.
We terminate CIFS to see if that is.  CPU drops down, but not to zero.
We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home>  sysstat 2
 CPU    NFS   CIFS   HTTP      Net kB/s    Disk kB/s     Tape kB/s    Cache
                               in   out    read write    read write     age
 10%      0      0      0      10     6   12074     0       0     0      24
 47%      0      0      0       8     5    6369     7       0     0      24
 47%      0      0      0      13     7   16479 20258       0     0      24
 33%      0      0      0       4     3   11858 13180       0     0      24
 34%      0      0      0       8     4   12698 14703       0     0      24
  9%      0      0      0       9     5   11392     8       0     0      24
  9%      0      0      0       7     4   11097     0       0     0      24
 58%      0      0      0       9     3    6453  2415       0     0      24
 39%      0      0      0       7     2   15218 17034       0     0      24
 39%      0      0      0       5     2   13560 16924       0     0      24
 31%      0      0      0       9     5   11633 11593       0     0      24
  8%      0      0      0       8     4   10634     8       0     0      24
 10%      0      0      0       9     5   12992     0       0     0      23
 62%      0      0      0       8     3    9828  8030       0     0      23
 44%      0      0      0       9     4   17156 19024       0     0      23
 37%      0      0      0       6     3   15229 17994       0     0      23
  9%      0      0      0       6     2   12204     8       0     0      23
 10%      0      0      0      11     5   13574     8       0     0      23
 66%      0      0      0       5     3    9354 11421       0     0      23
Pardon my French, but WTF is this filer doing?  It looks and smells
like snapshot behaviour, but we weren't even near to the time it
should be doing a snapshot via the schedule.  No external scripts would
initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp`
to check on the consistency points, since we are suspecting this poor
filer is write-bound (insufficient NVRAM cache--half is "lost" to the
partner for clustering).  The counts of all the cp_* parameters show
*less* than the minimum number of consistency points expected (uptime
times 6 per minute, i.e., a minimum of once every 10 seconds).  And, of
course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this?  Any idea what is going on?  Are we
just beating the crap out of this thing, and it gives up the ghost by
pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in
testing, soon to be deployed.  Not soon enough.  Figures.)
Until next time...
The Mathworks, Inc.				508-647-7000 x7792
3 Apple Hill Drive, Natick, MA 01760-2098	508-647-7001 FAX
tmerrill@mathworks.com				http://www.mathworks.com
---