Scenario:
clustered F840's, ONTAP 5.3.7R2 one is a busy filer of mostly home directories other is fine, no problems, much less loaded
middle of the day, response takes a nosedive on the busy one CPU is pegged at 100% very little NFS, CIFS, or network traffic no backups or restores going on no snapshots in progress (that we can tell)
As a user, response is *extremely* slow; sometimes a stat of a known populated directory returns empty. Effectively the filer is not serving data. (Worse, in my opinion, that it returns the *wrong* data.)
We turn off NFS to see if that is the culprit. Still CPU is pegged. We terminate CIFS to see if that is. CPU drops down, but not to zero. We have only NFS, CIFS, and cluster licensed (i.e., no HTTP).
Here's what we saw:
home> sysstat 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 10% 0 0 0 10 6 12074 0 0 0 24 47% 0 0 0 8 5 6369 7 0 0 24 47% 0 0 0 13 7 16479 20258 0 0 24 33% 0 0 0 4 3 11858 13180 0 0 24 34% 0 0 0 8 4 12698 14703 0 0 24 9% 0 0 0 9 5 11392 8 0 0 24 9% 0 0 0 7 4 11097 0 0 0 24 58% 0 0 0 9 3 6453 2415 0 0 24 39% 0 0 0 7 2 15218 17034 0 0 24 39% 0 0 0 5 2 13560 16924 0 0 24 31% 0 0 0 9 5 11633 11593 0 0 24 8% 0 0 0 8 4 10634 8 0 0 24 10% 0 0 0 9 5 12992 0 0 0 23 62% 0 0 0 8 3 9828 8030 0 0 23 44% 0 0 0 9 4 17156 19024 0 0 23 37% 0 0 0 6 3 15229 17994 0 0 23 9% 0 0 0 6 2 12204 8 0 0 23 10% 0 0 0 11 5 13574 8 0 0 23 66% 0 0 0 5 3 9354 11421 0 0 23
Pardon my French, but WTF is this filer doing? It looks and smells like snapshot behaviour, but we weren't even near to the time it should be doing a snapshot via the schedule. No external scripts would initiate one, either.
Our solution was, unfortunately, a reboot.
The second time this happened, we grabbed some output from `wafl_susp` to check on the consistency points, since we are suspecting this poor filer is write-bound (insufficient NVRAM cache--half is "lost" to the partner for clustering). The counts of all the cp_* parameters show *less* than the minimum number of consistency points expected (uptime times 6 per minute, i.e., a minimum of once every 10 seconds). And, of course, lots of cp_from_log_full and cp_from_cp.
Anybody seen anything like this? Any idea what is going on? Are we just beating the crap out of this thing, and it gives up the ghost by pretending it is busy to avoid doing anything else?
(We've already got another clustered pair of F840's in house, in testing, soon to be deployed. Not soon enough. Figures.)
Until next time...
The Mathworks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---