On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.
On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).
So I'll shut those down for a while, and see if that helps.
On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.
On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks
On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).
So I'll shut those down for a while, and see if that helps.
On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.
On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login.
(And on the filer, I get 'connection timed out' messages).
I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.
Our next line is 'reboot it', which'll have to wait until an outage window.
Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals).
On 10 August 2016 at 18:04, Douglas Siggins siggins@gmail.com wrote:
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks
On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).
So I'll shut those down for a while, and see if that helps.
On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.
On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Again similar issue. Pretty sure I did a kill -9, but not positive. I believe my issue was similar to this:
http://community.netapp.com/t5/Data-ONTAP-Discussions/Systemshell-cant-ssh-t...
On Thu, Aug 11, 2016 at 7:43 AM, Edward Rolison ed.rolison@gmail.com wrote:
With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login.
(And on the filer, I get 'connection timed out' messages).
I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.
Our next line is 'reboot it', which'll have to wait until an outage window.
Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals).
On 10 August 2016 at 18:04, Douglas Siggins siggins@gmail.com wrote:
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks
On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).
So I'll shut those down for a while, and see if that helps.
On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.
On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).
I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.
What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.
The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).
But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?
This is a 7 mode filer, on 8.2.1
I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.
I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.
My current ps list looks like:
Process statistics over 67.328 seconds...
ID State Domain %CPU StackUsed %StackUsed Name
195 RR N 47% 6928 10% NwkThd_00
196 RR N 47% 7880 12% NwkThd_01
197 RR 0 47% 6928 10% NwkThd_02
223 BR s 7% 7648 46% pmcsas_intrd_1
259 BR e 5% 2440 19% fal_io_thread2
502 BR R 7% 7448 45% raidio_thread
503 BR R 7% 7448 45% raidio_thread
635 BG k 6% 15184 11% snmpd
1614 BR 0 5% 3464 10% ntm_main
1711 RR w 35% 14256 21% wafl_exempt00
1712 BR w 35% 14136 21% wafl_exempt01
1713 BR w 35% 14136 21% wafl_exempt02
2599 BR k 5% 2752 8% gr_scheduler
That seems pretty busy for a 4cpu system...
Thanks and regards, Ed.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters