ssh timing out (filer high CPU load)

List overview All Threads
Download

newer

older

8.2 CDOT Snapmirror reverse resync

cdot initial setup help

Edward Rolison

10 Aug 2016 10 Aug '16

11:51 a.m.

On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Attachments:

attachment.html (text/html — 2.7 KB)

Show replies by date

Douglas Siggins

10 Aug 10 Aug

3:33 p.m.

The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:

...

On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Edward Rolison

4:26 p.m.

Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:

...

The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:

...
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Douglas Siggins

5:04 p.m.

Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks

On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:

...

Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:

...
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:

...
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Edward Rolison

11 Aug 11 Aug

11:43 a.m.

With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login.

(And on the filer, I get 'connection timed out' messages).

I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window.

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals).

On 10 August 2016 at 18:04, Douglas Siggins siggins@gmail.com wrote:

...

Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks

On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:

...
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:

...
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:

...
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Douglas Siggins

2:38 p.m.

Again similar issue. Pretty sure I did a kill -9, but not positive. I believe my issue was similar to this:

http://community.netapp.com/t5/Data-ONTAP-Discussions/Systemshell-cant-ssh-t...

On Thu, Aug 11, 2016 at 7:43 AM, Edward Rolison ed.rolison@gmail.com wrote:

...

With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login.

(And on the filer, I get 'connection timed out' messages).

I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window.

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals).

On 10 August 2016 at 18:04, Douglas Siggins siggins@gmail.com wrote:

...
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks

On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison ed.rolison@gmail.com wrote:

...
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins siggins@gmail.com wrote:

...
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison ed.rolison@gmail.com wrote:

...
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards, Ed.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

3297

Age (days ago)

3298

Last active (days ago)

toasters@lists.teaparty.net

5 comments

2 participants

tags (0)

participants (2)

Douglas Siggins
Edward Rolison