we had some slow down a while back due to dedupe processes running at the deault midnight. Not 93 seconds worth, but noticeable slowness. WE stagger them now obviously and don't run dedupe on our vm vols everynight
Some questions would be:
how many disks/type/speed are in the aggr?
If the disks for the VM's are not aligned, that's a big one
Maybe need to add more disks or run a reallocate against the vols.
In my case, I have to run reallocate on my vols. I have hot disks.
A while back, I had an issue with the filer running out of memory and I was dropping packets (very nerve-racking). That was 7.x days.
Date: Thu, 24 May 2012 14:50:09 +0200
From: amon@aelita.org
To: scl@virginia.edu
Subject: Re: Diagnosing outage on FAS6040 running 8.0.1 7-mode
CC: toasters@teaparty.net
Le 24/05/2012 08:28, Steve Losen a écrit:
>
> Hello Toasters,
>
> We had an incident where a FAS6040 running 8.0.1 gave very
> slow response to a variety of requests. This resulted in
> a bunch of Linux VMs experiencing failed disk I/O requests,
> which resulted in corrupted local Linux filesystems. One
> Linux VM logged that it waited 180 sec. for a disk I/O to
> complete.
>
> The filer did not reboot, and recovered on its own, but we
> are very interested in figuring out what happened and if we
> can avoid it (known bug?, should we upgrade ONTAP?)
>
> This 6040 does a variety of jobs -- NFS server for a Communigate Pro
> email system, NFS volumes for VMWARE, and it even has a few FC
> SAN LUNs used by sharepoint. In general it does not appear to be
> overloaded. It's been up for over 300 days with no problems.
> The CF partner filer experienced no problems at that time and it
> also holds VMWARE volumes and mail volumes.
>
> At the time of the outage the /etc/messages file indicated a slow
> NFS response (93 sec) to one of the Communigate servers. It also
> indicated that a FC LUN was reset by the sharepoint server. I'm
> guessing due to delayed response. The mail and sharepoint
> volumes are in the same aggregate. I see three resets for the
> LUN in /etc/messages.
>
> I looked in /etc/log/ems at about the time of the outage
> (Tue May 22 17:30 EDT) I see that a raid scrub of a raid
> group in the mail/sharepoint aggregate completed with no errors.
> I also see wafl_cp_toolong_warning_1 and wafl_cp_slovol_warning_1
> for a different aggregate (which contains VMs). I see several
> of these, which I presume are caused by slow completion of CPs.
>
> I don't know if these caused the problem or were caused by the
> problem.
>
> Anyone have any suggestions for further investigation or diagnosis?
> Any other logs to look at? Everything is fine now and has been
> running fine since the outage.
I would start by looking at the graphs of the netapp interfaces traffic,
cpu and iops to see if a storage client suddenly caused an activity spike
which slowed down everything else.
--
Herve Boulouis
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters