On Thu, Aug 11, 2016 at 7:43 AM, Edward Rolison <ed.rolison@gmail.com> wrote:

With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login.

(And on the filer, I get 'connection timed out' messages).

I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window.

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals).

On 10 August 2016 at 18:04, Douglas Siggins <siggins@gmail.com> wrote:
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks

On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <ed.rolison@gmail.com> wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins <siggins@gmail.com> wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:
Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

195 RR    N       47%      6928        10% NwkThd_00

196 RR    N       47%      7880        12% NwkThd_01

197 RR    0       47%      6928        10% NwkThd_02

223 BR    s        7%      7648        46% pmcsas_intrd_1

259 BR    e        5%      2440        19% fal_io_thread2

502 BR    R        7%      7448        45% raidio_thread

503 BR    R        7%      7448        45% raidio_thread

635 BG    k       6%     15184        11% snmpd

1614 BR    0        5%      3464        10% ntm_main

1711 RR    w       35%     14256        21% wafl_exempt00

1712 BR    w       35%     14136        21% wafl_exempt01

1713 BR    w       35%     14136        21% wafl_exempt02

2599 BR    k        5%      2752         8% gr_scheduler

That seems pretty busy for a 4cpu system...

Thanks and regards,
Ed.

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters