That's SMVI - the problem is less pronounced there because there's (typically) significantly less I/O to replay from the vmsnap deletion (it's a factor of activity, which is highest over a longer period of time - the time is very short with SMVI). However, with regular vsnaps (outside of SMVI), the snapshots are created and remain for a longer period of times (I've seen some that were MONTHS old in some environments) - deleting these would be a significant impact to the CPU and storage I/O.
________________________________
From: Darren Sykes [mailto:Darren.Sykes@csr.com] Sent: Tuesday, November 04, 2008 8:53 AM To: Glenn Walker; Karlsson Ulf Ibrahim :ULK; toasters@mathworks.com Subject: RE: Brief outages on the filer?
That's true - snapshot deletions are very heavy on CPU.
We've got quite a bit of headroom admittedly, but on a 6070 I can't see the spike's in IO when SMVI commits the changes; if you think about it, it's only the changes that have happened during the time the machines are being backup up, which is about 10 seconds in our environment so the impact isn't too great, and takes < half a second.
________________________________
From: Glenn Walker [mailto:ggwalker@mindspring.com] Sent: 04 November 2008 13:45 To: Darren Sykes; Karlsson Ulf Ibrahim :ULK; toasters@mathworks.com Subject: RE: Brief outages on the filer?
Well... maybe:
When the vmsnaps are deleted, it could definitely be a factor (and could be a factor even without the bug):
Part of the problem with the vmsnaps is that they have to replay all of the data from the vmware snapshots back into the active VMDK - that would mean a pretty heavy read/write pattern on the filer at the time, which _could_ impact the whole system (disk I/O contention, CP contention).
Still - it's not what we're facing.
________________________________
From: Darren Sykes [mailto:Darren.Sykes@csr.com] Sent: Tuesday, November 04, 2008 8:39 AM To: Glenn Walker; Karlsson Ulf Ibrahim :ULK; toasters@mathworks.com Subject: RE: Brief outages on the filer?
It also wouldn't explain the issues on the iSCSI SQL LUNS also performing badly at the same time (the Vmware bug means you do still get access to the disks, it's just ESX doesn't unstun the VM's for a period of time).
________________________________
size=2 width="100%" align=center tabIndex=-1>
From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Glenn Walker Sent: 04 November 2008 13:19 To: Karlsson Ulf Ibrahim :ULK; toasters@mathworks.com Subject: RE: Brief outages on the filer?
Thought about that, but we aren't taking vmsnaps (yet). We DID run into that bug, however - caused split brain on the ESX clusters during an HA event. Bad stuff.
________________________________
From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Karlsson Ulf Ibrahim :ULK Sent: Tuesday, November 04, 2008 6:23 AM To: toasters@mathworks.com Subject: SV: Brief outages on the filer?
Maybe this from http://media.netapp.com/documents/tr-3428.pdf (Netapp+VMware storage best practices)
When using VMware snapshots (VMsnaps) with NFS Datastores a condition exists where I/O to the VM is suspended while VMsnaps are being deleted (or more technically speaking the process of committing the redo logs to the VMDK files occur). This issue is experienced with any VMware technology that leverages VMsnaps such as VMware Consolidated Backup, Storage VMotion, Scalable Virtual Images, etc. and SnapManager for Virtual Infrastructure from NetApp. VMware has identified this behavior as a bug (SR195302591), and has released patch ESX350-200808401-BG which addresses this bug. At present time, this patch applies to ESX version 3.5, updates 1 and 2 only. If you plan on leveraging any of the applications that require the VMsnap process please apply this patch, and complete its installation requirements, on the ESX servers in your environment. If you are using scripts in order to take on disk snapshot backups and are unable to upgrade your systems, then VMware and NetApp recommend that you discontinue the use of the VMsnap process prior to executing the NetApp snapshot.
/Uffe
-----Ursprungligt meddelande----- Från: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] För Glenn Walker Skickat: den 3 november 2008 20:03 Till: Page, Jeremy; toasters@mathworks.com Ämne: RE: Brief outages on the filer?
Any way you can predict when it will happen? Sysstat (or better yet, perfstat) would be of help here.
Something I've noticed on my infrastructure: VMWare over NFS (unsure about other protocols) will have huge spikes where they write lots of data in a quick burst - happens only a few times a day on relatively quiet systems, but I can definitely see a spike on the filer. Perhaps you have the same thing going, just a SWAG...
The impact on our side is not really felt - but the filer does go into back2back CPs from the massive spike (200MB/s - 350MB/s in a short window) and that could manifest itself as 'poor disk response time'.
In our case, we're running VMWare over NFS and Exchange over iSCSI on the same filers, but no one is really complaining when the 'events' happen. Just something I've noticed for a while.
FAS6070 and the busy time is recorded around 6000 NFS IOPS. That said, we did a stress test with about 25 guests running IOMeter and were able to push 15000 NFS OPS on node 1, 10000 NFS OPS on node 2 (a combined 400MB/s write, 300MB/s read) without any sort of reported performance problems.
________________________________
From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Page, Jeremy Sent: Monday, November 03, 2008 11:02 AM To: toasters@mathworks.com Subject: Brief outages on the filer?
I am seeing brief outages where my VMs (NFS as the back end protocol) and SQL LUNs (FC) both complain of poor disk response time at the same time. I don't think it can be the infrastructure since one is IP and the other FC. The LUNs are on a different set of spindles/different aggr then the NFS volumes as well, so I don't think it's a disk bottleneck. I'm on a 3070 and rarely do we hit 3500 IOPS (and 90+% of that out of cache) or go above 40% for the busiest CPU (normally we're in the 15-25% range) so I am not sure what's going on here, any suggestions on how to troubleshoot it?
We're running 7.2.4, I want to wait for 7.3.1 to upgrade since we are using NFS for VMware and there are several fixes that will be beneficial to us.
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance. In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
To report this email as spam click here https://www.mailcontrol.com/sr/wQw0zmjPoHdJTZGyOCrrhg== .