resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/> </LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle
thanks
What are you using for the backend?
--sk
On 1/15/2013 1:12 PM, Fletcher Cocquyt wrote:
resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/>
</LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing
- things are idle
thanks
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
aggr2 consists of 95 x 15K RPM disks -
netapp support asked if any of the 5 symptoms from this KB doc applied and they don't
https://kb.netapp.com/support/index?page=content&id=2011942
thanks
On Jan 15, 2013, at 2:46 PM, Stuart Kendrick skendric@fhcrc.org wrote:
What are you using for the backend?
--sk
On 1/15/2013 1:12 PM, Fletcher Cocquyt wrote:
resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/>
</LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle
thanks
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Check alignment of vmdk's Sent from my Verizon Wireless BlackBerry
-----Original Message----- From: Fletcher Cocquyt fcocquyt@stanford.edu Sender: toasters-bounces@teaparty.net Date: Tue, 15 Jan 2013 13:12:55 To: toasters@teaparty.net Liststoasters@teaparty.net Subject: wafl_cp_slovol_warning_1 with big latency spikes
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Our VMs are aligned - we went through that exercise 2.5 years ago We also run a daily alignment report (using mbrscan) to catch any appliance type VMs which could be unaligned
thanks
On Jan 15, 2013, at 3:16 PM, "Jack Lyons" jack1729@gmail.com wrote:
Check alignment of vmdk's Sent from my Verizon Wireless BlackBerry
-----Original Message----- From: Fletcher Cocquyt fcocquyt@stanford.edu Sender: toasters-bounces@teaparty.net Date: Tue, 15 Jan 2013 13:12:55 To: toasters@teaparty.net Liststoasters@teaparty.net Subject: wafl_cp_slovol_warning_1 with big latency spikes
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Fletcher,
What ONTAP version are you running?
We've had a case open since we swapped our heads out to 3270s with cp_slo_vols that seemed to be happening at random, but we thought we'd narrowed it down to times when large metadata writes are occuring. Deletions into snapshots, for example. Latterly I could reliably trigger it with storage vmotions - it would normally occur at the end of the process (i.e. when VMware deletes the files and it ends up in snaps)
Netapp had us upgrade to 8.1.2RC2 with reference to some bug IDs I don't have to hand at the moment. We thought this had fixed it - certainly storage vmotions were not triggering it, however, it reappeared when a number of LUNs were deleted at once the other day.
Regards,
Tim
On 15 January 2013 21:12, Fletcher Cocquyt fcocquyt@stanford.edu wrote:
resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/>
</LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle
thanks
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I have seen the same issue when deleting large files off CIFS volumes, resulting in astronomical latency on the whole filer. Especially noticeable on SATA aggrs.
ONTAP/WAFL seem to have an issue reclaiming zombie blocks.
/ Marcus
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters- bounces@teaparty.net] On Behalf Of Tim Parkinson Sent: den 16 januari 2013 07:45 To: Fletcher Cocquyt Cc: toasters@teaparty.net Lists Subject: Re: wafl_cp_slovol_warning_1 with big latency spikes
Fletcher,
What ONTAP version are you running?
We've had a case open since we swapped our heads out to 3270s with cp_slo_vols that seemed to be happening at random, but we thought we'd narrowed it down to times when large metadata writes are occuring. Deletions into snapshots, for example. Latterly I could reliably trigger it with storage vmotions - it would normally occur at the end of the process (i.e. when VMware deletes the files and it ends up in snaps)
Netapp had us upgrade to 8.1.2RC2 with reference to some bug IDs I don't have to hand at the moment. We thought this had fixed it - certainly storage vmotions were not triggering it, however, it reappeared when a number of LUNs were deleted at once the other day.
Regards,
Tim
On 15 January 2013 21:12, Fletcher Cocquyt fcocquyt@stanford.edu wrote:
resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage
timeouts.
This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/>
</LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing
- things are idle
thanks
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- Tim Parkinson Storage & Server Administrator Corporate Information & Computing Services The University of Sheffield 10-12 Brunswick Street Sheffield S10 2FN
E-Mail: t.r.parkinson@sheffield.ac.uk Tel: +44 (0) 114 222 3039 http://www.sheffield.ac.uk/cics/ _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I don't have any brilliant suggestions here ... but I am intensely interested to hear how one can narrow down the fault domain ... as we intermittently see a similar set of symptoms in our own gear.
Heads <===> Fabric <===> Backend
So the Backend isn't servicing requests as quickly as the clients would like ... but why?
(a) Backend: "Not enough spindles" for the load ... the queues on the backend grow to be sufficiently large (or were dropping requests, requiring retransmits) that IOs were taking a long time to complete, so long that timers on the VMWare hosts (and their guests) noticed and complained. Similarly, CPs were taking a long time to complete.
(b) Frontend: The frontend fumbled locks on the backend, e.g., laying down a write lock on a block, 'forgetting' to release it, trying to lock that same block again ... and having to go through a preempt routine before straightening things out ... IOs stalled for a while, timers on the VMWare hosts and guests noticed and complained ...
(c) Fabric: The switch fabric in between front & back dropped frames (e.g. physical layer error) ... [in Fletcher's case, perhaps there is no Fabric, merely point-to-point links ... but conceptually this is a possible component]
(d) I suspect many others ...
So, for example, wafl_cp_toolong merely tells us that the Backend isn't servicing write requests as quickly as the Frontend wants (high-latency) ... but it doesn't tell us why. How does one drill down into what is happening between Front & Back?
==> Is there a tool which records lock activity? ==> How does one insert a 'sniffer' into the path between Front & Back to capture FC (or SAS) traffic?
--sk
On 1/15/2013 1:12 PM, Fletcher Cocquyt wrote:
resending this without the 80kb chart
Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
Hello Toasters,
I have a new FAS2240-4, I installed using a laptop serial port connected to the Console Port with no problem. Now, when I connect via the same cable etc I cannot see anything on the HyperTerminal, there is no response. If I plug the cable into a FAS2020 or a FAS3140 I get a login prompt and can login ok. I have powered-off and re-started with the Console cable and nothing, I tried Putty as well the result was the same. I have tried a new cable and that doesn't work. Any ideas ?
-------------------------------------------------------------------------------------------------------------------------------------------------------- Cheers David
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi David,
how about connecting via SP and then typing "system console"...
Maybe you could investigate further that way... (It's best practice, anyway)
Sebastian
On 16.01.2013 17:41, David Laithwaite wrote:
Hello Toasters,
I have a new FAS2240-4, I installed using a laptop serial port connected to the Console Port with no problem.
Now, when I connect via the same cable etc I cannot see anything on the HyperTerminal, there is no response. If I plug the cable into a FAS2020 or a FAS3140 I get a login prompt and can login ok.
I have powered-off and re-started with the Console cable and nothing, I tried Putty as well the result was the same. I have tried a new cable and that doesn't work.
Any ideas ?
Cheers
David
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
And username is naroot, and your root password..
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Sebastian Goetze Sent: Wednesday, January 16, 2013 9:08 AM To: David Laithwaite Cc: toasters@teaparty.net Lists Subject: Re: console issue
Hi David,
how about connecting via SP and then typing "system console"...
Maybe you could investigate further that way... (It's best practice, anyway)
Sebastian On 16.01.2013 17:41, David Laithwaite wrote: Hello Toasters,
I have a new FAS2240-4, I installed using a laptop serial port connected to the Console Port with no problem. Now, when I connect via the same cable etc I cannot see anything on the HyperTerminal, there is no response. If I plug the cable into a FAS2020 or a FAS3140 I get a login prompt and can login ok. I have powered-off and re-started with the Console cable and nothing, I tried Putty as well the result was the same. I have tried a new cable and that doesn't work. Any ideas ?
-------------------------------------------------------------------------------------------------------------------------------------------------------- Cheers David
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
Toasters mailing list
Toasters@teaparty.netmailto:Toasters@teaparty.net
And if the root password is longer than 16 characters, just truncate it to 16 characters and use this password for the naroot password.
bye,
Alexander Griesser
System-Administrator
ANEXIA Internetdienstleistungs GmbH
Telefon: +43-463-208501-320
Telefax: +43-463-208501-500
E-Mail: ag@anexia.at
Web: http://www.anexia.at
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Klise, Steve Gesendet: Mittwoch, 16. Jänner 2013 18:11 An: 'Sebastian Goetze'; David Laithwaite Cc: toasters@teaparty.net Lists Betreff: RE: console issue
And username is naroot, and your root password..
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Sebastian Goetze Sent: Wednesday, January 16, 2013 9:08 AM To: David Laithwaite Cc: toasters@teaparty.net Lists Subject: Re: console issue
Hi David,
how about connecting via SP and then typing "system console"...
Maybe you could investigate further that way... (It's best practice, anyway)
Sebastian
On 16.01.2013 17:41, David Laithwaite wrote:
Hello Toasters,
I have a new FAS2240-4, I installed using a laptop serial port connected to the Console Port with no problem.
Now, when I connect via the same cable etc I cannot see anything on the HyperTerminal, there is no response. If I plug the cable into a FAS2020 or a FAS3140 I get a login prompt and can login ok.
I have powered-off and re-started with the Console cable and nothing, I tried Putty as well the result was the same. I have tried a new cable and that doesn't work.
Any ideas ?
--------------------------------------------------------------------------------------------------------------------------------------------------------
Cheers
David
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I should of added that what I really want to do is config ControllerB but when I tried to access it, I got the same behaviour as I explained below. I was using ControllerA as a test.
When I start a console on SP I can see the result of the command I entered on the HyperTerminal. But I need to config controller.
-------------------------------------------------------------------------------------------------------------------------------------------------------- Thanks David
From: Sebastian Goetze [mailto:spgoetze@gmail.com] Sent: 16 January 2013 17:08 To: David Laithwaite Cc: toasters@teaparty.net Lists Subject: Re: console issue
Hi David,
how about connecting via SP and then typing "system console"...
Maybe you could investigate further that way... (It's best practice, anyway)
Sebastian On 16.01.2013 17:41, David Laithwaite wrote: Hello Toasters,
I have a new FAS2240-4, I installed using a laptop serial port connected to the Console Port with no problem. Now, when I connect via the same cable etc I cannot see anything on the HyperTerminal, there is no response. If I plug the cable into a FAS2020 or a FAS3140 I get a login prompt and can login ok. I have powered-off and re-started with the Console cable and nothing, I tried Putty as well the result was the same. I have tried a new cable and that doesn't work. Any ideas ?
-------------------------------------------------------------------------------------------------------------------------------------------------------- Cheers David
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
Toasters mailing list
Toasters@teaparty.netmailto:Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.