Hello all;
A week or so ago, our N6240E21 (FAS3240C) started reporting 100% CPU utilization. Graphs from our monitoring system show the typical random peaks and valleys averaging around 20-30% utilization, then suddenly a plateau at 100% lasting for an entire week (and still ongoing).
The weird thing is -- the filer really *isn't* that busy:
red-str-napc1-p2> sysstat -x -c 10 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 100% 1489 530 0 2019 13174 10363 14636 11588 0 0 4s 84% 20% : 21% 0 0 0 0 0 0 0 100% 2001 636 0 2637 29590 20895 25112 14140 0 0 4s 77% 15% Hn 16% 0 0 0 0 0 0 0 100% 1140 608 0 1748 11829 6349 14396 58396 0 0 4s 95% 100% :f 22% 0 0 0 0 0 0 0 99% 3703 404 0 4107 38005 108512 164912 23204 0 0 4s 65% 46% : 60% 0 0 0 0 0 0 0 100% 1429 195 0 1627 15986 10483 23296 53132 0 0 3s 93% 37% Hn 25% 3 0 0 0 0 0 0 100% 1440 35 0 1475 32821 11302 16796 35488 0 0 3s 91% 68% : 20% 0 0 0 0 0 0 0 100% 1461 0 0 1461 10467 8030 7912 32 0 0 3s 75% 0% - 23% 0 0 0 0 0 0 0 100% 1845 280 0 2125 28710 17652 19624 12636 0 0 3s 83% 17% Hn 24% 0 0 0 0 0 0 0 100% 2070 191 0 2261 6911 20148 23964 80048 0 0 3s 94% 66% : 19% 0 0 0 0 0 0 0 100% 1477 153 0 1633 29005 8196 8536 24 0 0 3s 76% 0% - 12% 3 0 0 0 0 0 0
red-str-napc1-p2*> sysstat -m -c 10 1 ANY AVG CPU0 CPU1 CPU2 CPU3 47% 65% 85% 70% 74% 30% 74% 76% 87% 78% 83% 56% 47% 65% 85% 70% 75% 31% 59% 68% 84% 71% 76% 41% 49% 66% 85% 72% 76% 32% 50% 68% 83% 77% 79% 33% 56% 70% 87% 76% 79% 37% 47% 66% 85% 72% 76% 29% 29% 62% 86% 70% 75% 16% 36% 64% 86% 72% 76% 23%
red-str-napc1-p2*> priv set advanced red-str-napc1-p2*> ps -c 5 Process statistics over 1218393.619 seconds... ID State Domain %CPU StackUsed %StackUsed Name 5 RR i 76% 1016 24% idle_thread0 6 RR i 76% 904 22% idle_thread1 7 RR i 75% 904 22% idle_thread2 8 RR i 64% 1024 25% idle_thread3 89 BG 1 5% 4440 6% NwkThd_01 108 RR 2 7% 1944 5% 10GbE/e1b 294 BR r 12% 4104 25% raidio_thread 1539 BR w 6% 5736 17% wafl_exempt_0 1540 BR w 6% 5736 17% wafl_exempt_1 1541 BR w 6% 5736 17% wafl_exempt_2 1544 BR k 9% 13720 41% wafl_lopri
(A few other processes are >0%, but these are the most notable).
We are running on ONTAP 8.0.2P3 (7-mode) and the filer is primarily doing NFS for VMware datastores with a bit of CIFS sharing mixed in.
We have opened a support case with IBM, but so far they are telling us that this may be "normal". They're still helping us investigate, so we may yet get something from that route, but wanted to throw this out here because this certainly doesn't seem normal.
I'm guessing a controller reboot would solve the problem, but would like to see if there is an alternative or an explanation first.
This thread[1] seems similar, but there wasn't really a resolution.
Thanks, Ray