Le 24/05/2012  08:28, Steve Losen a écrit:
> 
> Hello Toasters,
> 
> We had an incident where a FAS6040 running 8.0.1 gave very
> slow response to a variety of requests.  This resulted in
> a bunch of Linux VMs experiencing failed disk I/O requests,
> which resulted in corrupted local Linux filesystems.  One
> Linux VM logged that it waited 180 sec. for a disk I/O to
> complete.
> 
> The filer did not reboot, and recovered on its own, but we
> are very interested in figuring out what happened and if we
> can avoid it (known bug?, should we upgrade ONTAP?)
> 
> This 6040 does a variety of jobs -- NFS server for a Communigate Pro
> email system, NFS volumes for VMWARE, and it even has a few FC
> SAN LUNs used by sharepoint.  In general it does not appear to be
> overloaded.  It's been up for over 300 days with no problems.
> The CF partner filer experienced no problems at that time and it
> also holds VMWARE volumes and mail volumes.
> 
> At the time of the outage the /etc/messages file indicated a slow
> NFS response (93 sec) to one of the Communigate servers.  It also
> indicated that a FC LUN was reset by the sharepoint server.  I'm
> guessing due to delayed response.  The mail and sharepoint
> volumes are in the same aggregate.  I see three resets for the
> LUN in /etc/messages.
> 
> I looked in /etc/log/ems at about the time of the outage
> (Tue May 22 17:30 EDT)  I see that a raid scrub of a raid
> group in the mail/sharepoint aggregate completed with no errors. 
> I also see wafl_cp_toolong_warning_1 and wafl_cp_slovol_warning_1
> for a different aggregate (which contains VMs).  I see several
> of these, which I presume are caused by slow completion of CPs.
> 
> I don't know if these caused the problem or were caused by the
> problem.
> 
> Anyone have any suggestions for further investigation or diagnosis?
> Any other logs to look at?  Everything is fine now and has been
> running fine since the outage.

I would start by looking at the graphs of the netapp interfaces traffic,
cpu and iops to see if a storage client suddenly caused an activity spike
which slowed down everything else.

-- 
Herve Boulouis