Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
This is very commonly an alignment issue. Snapshots exacerbate the already intense I/O caused by misalignment. Essentially, the existence of VMware snapshots (although they are 100% successful), cause the I/O on the system to intensify while ONTAP is actually trying to quiesce for a snapshot. So we are having a bad situation that intensifies at the wrong time.
Have you tried to take the NetApp snapshot manually (sans SMVI) after the VMware snapshots have all completed? My guess is that you will find that the snapshot takes a very long time to complete (if it does).
Here is a document that discusses block alignment and the breadth of it's impact:
Best Practices for File System Alignment in Virtual Environments: http://www.netapp.com/us/library/technical-reports/tr-3747.html
We have a tool called 'mbrscan' for identifying misalignment which is a part of our ESX Host Utilities available here:
FC Host Utilities for ESX(r): http://now.netapp.com/NOW/download/software/sanhost_esx/ESX
We also have this KB article for identifying misalignment:
How to diagnose misaligned I/O on Windows hosts: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb36108
Finally, while mbralign is now part of the ESX Host Utilities, there is good, detailed documentation here as you navigate to the download page:
mbralign: http://now.netapp.com/NOW/download/tools/mbralign
Stetson M. Webster Professional Services Consultant NCIE-SAN, NCIE-B&R, SNIA-SCSN-E NetApp Professional Services - East 919.250.0052 Mobile Stetson.Webster@netapp.com Learn how: netapp.com/guarantee
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 9:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
If your VMs are Windows it's relatively simple to do a WMI scan against them for the partition offset and then divide by 4096 (?) and make sure it's an even #. If not your not set up ideally. Something like below (please excuse my scriptomatic generated code).
Set objWMIService = GetObject("winmgmts:\" & strComputer & "\root\CIMV2") Set colItems = objWMIService.ExecQuery("SELECT * FROM Win32_DiskPartition", "WQL", _ wbemFlagReturnImmediately + wbemFlagForwardOnly)
For Each objItem In colItems WScript.Echo "StartingOffset: " & objItem.StartingOffset Next
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Webster, Stetson Sent: Tuesday, November 03, 2009 10:26 AM To: Steffen Kammerer; Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
This is very commonly an alignment issue. Snapshots exacerbate the already intense I/O caused by misalignment. Essentially, the existence of VMware snapshots (although they are 100% successful), cause the I/O on the system to intensify while ONTAP is actually trying to quiesce for a snapshot. So we are having a bad situation that intensifies at the wrong time.
Have you tried to take the NetApp snapshot manually (sans SMVI) after the VMware snapshots have all completed? My guess is that you will find that the snapshot takes a very long time to complete (if it does).
Here is a document that discusses block alignment and the breadth of it's impact:
Best Practices for File System Alignment in Virtual Environments: http://www.netapp.com/us/library/technical-reports/tr-3747.html
We have a tool called 'mbrscan' for identifying misalignment which is a part of our ESX Host Utilities available here:
FC Host Utilities for ESX(r): http://now.netapp.com/NOW/download/software/sanhost_esx/ESX
We also have this KB article for identifying misalignment:
How to diagnose misaligned I/O on Windows hosts: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb36108
Finally, while mbralign is now part of the ESX Host Utilities, there is good, detailed documentation here as you navigate to the download page:
mbralign: http://now.netapp.com/NOW/download/tools/mbralign
Stetson M. Webster Professional Services Consultant NCIE-SAN, NCIE-B&R, SNIA-SCSN-E NetApp Professional Services - East 919.250.0052 Mobile Stetson.Webster@netapp.com Learn how: netapp.com/guarantee
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 9:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance.
In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
Crud. Make sure the partition offset/4096 is an integer, not even. Sorry.
And I'm not too certain about the 4096 #, read the TR :)
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Page, Jeremy Sent: Tuesday, November 03, 2009 10:38 AM To: Webster, Stetson; Steffen Kammerer; Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
If your VMs are Windows it's relatively simple to do a WMI scan against them for the partition offset and then divide by 4096 (?) and make sure it's an even #. If not your not set up ideally. Something like below (please excuse my scriptomatic generated code).
Set objWMIService = GetObject("winmgmts:\" & strComputer & "\root\CIMV2") Set colItems = objWMIService.ExecQuery("SELECT * FROM Win32_DiskPartition", "WQL", _ wbemFlagReturnImmediately + wbemFlagForwardOnly)
For Each objItem In colItems WScript.Echo "StartingOffset: " & objItem.StartingOffset Next
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Webster, Stetson Sent: Tuesday, November 03, 2009 10:26 AM To: Steffen Kammerer; Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
This is very commonly an alignment issue. Snapshots exacerbate the already intense I/O caused by misalignment. Essentially, the existence of VMware snapshots (although they are 100% successful), cause the I/O on the system to intensify while ONTAP is actually trying to quiesce for a snapshot. So we are having a bad situation that intensifies at the wrong time.
Have you tried to take the NetApp snapshot manually (sans SMVI) after the VMware snapshots have all completed? My guess is that you will find that the snapshot takes a very long time to complete (if it does).
Here is a document that discusses block alignment and the breadth of it's impact:
Best Practices for File System Alignment in Virtual Environments: http://www.netapp.com/us/library/technical-reports/tr-3747.html
We have a tool called 'mbrscan' for identifying misalignment which is a part of our ESX Host Utilities available here:
FC Host Utilities for ESX(r): http://now.netapp.com/NOW/download/software/sanhost_esx/ESX
We also have this KB article for identifying misalignment:
How to diagnose misaligned I/O on Windows hosts: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb36108
Finally, while mbralign is now part of the ESX Host Utilities, there is good, detailed documentation here as you navigate to the download page:
mbralign: http://now.netapp.com/NOW/download/tools/mbralign
Stetson M. Webster Professional Services Consultant NCIE-SAN, NCIE-B&R, SNIA-SCSN-E NetApp Professional Services - East 919.250.0052 Mobile Stetson.Webster@netapp.com Learn how: netapp.com/guarantee
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 9:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance.
In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance.
In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
Crud. Make sure the partition offset/4096 is an integer, not even. Sorry.
And I'm not too certain about the 4096 #, read the TR :)
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Page, Jeremy Sent: Tuesday, November 03, 2009 10:38 AM To: Webster, Stetson; Steffen Kammerer; Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
If your VMs are Windows it's relatively simple to do a WMI scan against them for the partition offset and then divide by 4096 (?) and make sure it's an even #. If not your not set up ideally. Something like below (please excuse my scriptomatic generated code).
Set objWMIService = GetObject("winmgmts:\" & strComputer & "\root\CIMV2") Set colItems = objWMIService.ExecQuery("SELECT * FROM Win32_DiskPartition", "WQL", _ wbemFlagReturnImmediately + wbemFlagForwardOnly)
For Each objItem In colItems WScript.Echo "StartingOffset: " & objItem.StartingOffset Next
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Webster, Stetson Sent: Tuesday, November 03, 2009 10:26 AM To: Steffen Kammerer; Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
This is very commonly an alignment issue. Snapshots exacerbate the already intense I/O caused by misalignment. Essentially, the existence of VMware snapshots (although they are 100% successful), cause the I/O on the system to intensify while ONTAP is actually trying to quiesce for a snapshot. So we are having a bad situation that intensifies at the wrong time.
Have you tried to take the NetApp snapshot manually (sans SMVI) after the VMware snapshots have all completed? My guess is that you will find that the snapshot takes a very long time to complete (if it does).
Here is a document that discusses block alignment and the breadth of it's impact:
Best Practices for File System Alignment in Virtual Environments: http://www.netapp.com/us/library/technical-reports/tr-3747.html
We have a tool called 'mbrscan' for identifying misalignment which is a part of our ESX Host Utilities available here:
FC Host Utilities for ESX(r): http://now.netapp.com/NOW/download/software/sanhost_esx/ESX
We also have this KB article for identifying misalignment:
How to diagnose misaligned I/O on Windows hosts: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb36108
Finally, while mbralign is now part of the ESX Host Utilities, there is good, detailed documentation here as you navigate to the download page:
mbralign: http://now.netapp.com/NOW/download/tools/mbralign
Stetson M. Webster Professional Services Consultant NCIE-SAN, NCIE-B&R, SNIA-SCSN-E NetApp Professional Services - East 919.250.0052 Mobile Stetson.Webster@netapp.com Learn how: netapp.com/guarantee
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 9:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance.
In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
Please be advised that this email may contain confidential information. If you are not the intended recipient, please do not read, copy or re-transmit this email. If you have received this email in error, please notify us by email by replying to the sender and by telephone (call us collect at +1 202-828-0850) and delete this message and any attachments. Thank you in advance for your cooperation and assistance.
In addition, Danaher and its subsidiaries disclaim that the content of this email constitutes an offer to enter into, or the acceptance of, any contract or agreement or any amendment thereto; provided that the foregoing disclaimer does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment to this email.
Yes, we have the same issue.
Came down to a few things:
1. Update to 7.3.2p7. There are algorithm changes to WAFL that help with VMFS/VMDK reading/writing. I saw a HUGE performance change with this.
2. Check your CPU (systat 0), ours was pegged due to NDMP backups; changing the times helped out a bunch.
3. Disable File Sync in VMWare tools on each guest. The File Sync driver is problematic and not recommended. This is on each guest in add/remove programs for VMWare tools.
4. VMWare admitted this is a problem; most users accept the work around to not do quiesced backups. There is a checkbox in SMVI that will allow you to not do VMWare level snaps.
Otherwise try snaping smaller groups (10 max) of VMs. We're at about 80% success right now; not great but moving in the right direction.
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 6:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Data ONTap version should read 7.3.1.1p7.
-----Original Message----- From: Ken Williams Sent: Tuesday, November 03, 2009 9:21 AM To: 'Steffen Kammerer'; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Yes, we have the same issue.
Came down to a few things:
1. Update to 7.3.2p7. There are algorithm changes to WAFL that help with VMFS/VMDK reading/writing. I saw a HUGE performance change with this.
2. Check your CPU (systat 0), ours was pegged due to NDMP backups; changing the times helped out a bunch.
3. Disable File Sync in VMWare tools on each guest. The File Sync driver is problematic and not recommended. This is on each guest in add/remove programs for VMWare tools.
4. VMWare admitted this is a problem; most users accept the work around to not do quiesced backups. There is a checkbox in SMVI that will allow you to not do VMWare level snaps.
Otherwise try snaping smaller groups (10 max) of VMs. We're at about 80% success right now; not great but moving in the right direction.
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 6:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
Thanks for all your answers...
We made some tests yesterday with esx4 and cloning and snapshotting...
It seems that the challenge is not because of SMVI. If we try to clone these machines which failed to create a snapshot (with the error below) we get the same failure.
The error appears after 5 seconds...
But if do not quiesce we maybe get inconsistent snapshots... do you have any experience restoring not quiesced snapshots??
Thanks and best regards,
Steffen
-----Original Message----- From: Ken Williams [mailto:kwillia@smud.org] Sent: Tuesday, November 03, 2009 6:52 PM To: Ken Williams; Steffen Kammerer; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Data ONTap version should read 7.3.1.1p7.
-----Original Message----- From: Ken Williams Sent: Tuesday, November 03, 2009 9:21 AM To: 'Steffen Kammerer'; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Yes, we have the same issue.
Came down to a few things:
1. Update to 7.3.2p7. There are algorithm changes to WAFL that help with VMFS/VMDK reading/writing. I saw a HUGE performance change with this.
2. Check your CPU (systat 0), ours was pegged due to NDMP backups; changing the times helped out a bunch.
3. Disable File Sync in VMWare tools on each guest. The File Sync driver is problematic and not recommended. This is on each guest in add/remove programs for VMWare tools.
4. VMWare admitted this is a problem; most users accept the work around to not do quiesced backups. There is a checkbox in SMVI that will allow you to not do VMWare level snaps.
Otherwise try snaping smaller groups (10 max) of VMs. We're at about 80% success right now; not great but moving in the right direction.
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 6:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213
If you don't get VMWare snapshots in your SMVI process then your backups would be inconsistent. So conceptually the restore would be akin to a traditional physical server backup. The state of the VM will be unknown and thus there could be a fsck/chkdsk process that would need to occur. I find it perfectly acceptable for systems to be backed up "inconsistent"; this is the old methodology for backups. The only problem you run into is application awareness (i.e. VMWare pre/post snapshot scripts to quiesce applications or databases).
We sent one of our VMs that was consistently erroring with snapshot backups to NetApp; they were able to recreate the problem in their lab with our VM.
On a side note: VM disk alignment is HUGE, make sure you're aligned (I bet you're hearing a lot of this; it can really make a performance difference). I recommend the tools from NetApp: mbrscan/mbralign.
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Wednesday, November 04, 2009 1:00 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Thanks for all your answers...
We made some tests yesterday with esx4 and cloning and snapshotting...
It seems that the challenge is not because of SMVI. If we try to clone these machines which failed to create a snapshot (with the error below) we get the same failure.
The error appears after 5 seconds...
But if do not quiesce we maybe get inconsistent snapshots... do you have any experience restoring not quiesced snapshots??
Thanks and best regards,
Steffen
-----Original Message----- From: Ken Williams [mailto:kwillia@smud.org] Sent: Tuesday, November 03, 2009 6:52 PM To: Ken Williams; Steffen Kammerer; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Data ONTap version should read 7.3.1.1p7.
-----Original Message----- From: Ken Williams Sent: Tuesday, November 03, 2009 9:21 AM To: 'Steffen Kammerer'; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Yes, we have the same issue.
Came down to a few things:
1. Update to 7.3.2p7. There are algorithm changes to WAFL that help with VMFS/VMDK reading/writing. I saw a HUGE performance change with this.
2. Check your CPU (systat 0), ours was pegged due to NDMP backups; changing the times helped out a bunch.
3. Disable File Sync in VMWare tools on each guest. The File Sync driver is problematic and not recommended. This is on each guest in add/remove programs for VMWare tools.
4. VMWare admitted this is a problem; most users accept the work around to not do quiesced backups. There is a checkbox in SMVI that will allow you to not do VMWare level snaps.
Otherwise try snaping smaller groups (10 max) of VMs. We're at about 80% success right now; not great but moving in the right direction.
-----Original Message----- From: Steffen Kammerer [mailto:steffen.kammerer@brainlab.com] Sent: Tuesday, November 03, 2009 6:41 AM To: Ken Williams; Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Hi there,
We have the same issue with SMVI 2.0 on nfs datastores... on some VMs we get the following error:
Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
Does anybody approach the same error?!
Thanks and best regards,
Steffen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Ken Williams Sent: Monday, September 14, 2009 6:51 PM To: Nick Silkey Cc: toasters@mathworks.com Subject: RE: SMVI / VMWare Experiences...
Sounds like whatever user-defined script you have is failing sometimes? Or perhaps it's a VMWare tools issue.
We've been able to track our issue down to the Guest OS level (win2k3 specifically). Looks like its an issue with VSS or LUN alignment.
I would recommend ensuring your LUNs are aligned (use the VMWare host util kit, mbrscan / mbralign). There is detailed documentation on the NOW.netapp.com site.
-----Original Message----- From: Nick Silkey [mailto:nick@silkey.org] Sent: Friday, September 11, 2009 7:15 PM To: Ken Williams Cc: toasters@mathworks.com Subject: Re: SMVI / VMWare Experiences...
Ken --
We too are experiencing issues with SMVI 1.2 bombing out when attempting to perform a VMware quiesce snap on _some_ RHEL5.3 32-bit VMs. A couple of notables:
- These problematic VMs have a 100% success rate at taking VMware quiesce snaps within vCenter, independent of SMVI. - The problem is 100% reproducible during night, day, etc. - We will deploy several VMs at a crack, all the same build. When the next SMVI schedule hits, some fail while others succeed. Bizarre. - Over time (weve been experiencing this issue for several weeks now), the 'problem' VMs change. Example: VMs abc and xyz will fail for days; without intervention, VM abc will stop failing while VM xyz continues to fail ... even if theyre part of the same deploy base template/kickstart. - We are nowhere near our snap limit on the volumes. - These problematic VMs only bomb when attempting a quiesce. Non-quiesce SMVI snaps work like a champ.
Been working with NetApp and VMware for some time now. Were at ESX 3.5u4+ to an 3160-R5 @ 7.2.6.1P3 via NFS + vCenter 4.0 + synch SnapMirror to another 3160-R5 @ 7.2.6.1P3. The only thing revealing is SMVI + vCenter logs of "cannot create a quiesced snapshot because the (user-supplied) custom prefreeze script in the virtual machine exited with a nonzero return code".
-- Nick
On Wed, Aug 26, 2009 at 5:32 PM, Ken Williams kwillia@smud.org wrote:
I'm looking for some experiences people out there may have with SMVI with NetApp. We're currently experiencing major issues with SMVI snapshots failing. I've had open tickets with NetApp/VMWare/Microsoft for 3 months and still have yet to have a solution.
My environment looks like such:
6 x HP DL380 G5 (32gb Ram) in a ESX Cluster Dual Emulex 10000 Cards in
each host. Cisco MDS SAN Netapp FAS3070 Cluster ~9tb aggregate for VMWare. VMFS Datastores ~10-15 VMs per datastore. ~50gb per VM. ASIS Turned on Volume and LUNspace reservation turned off OnTap 7.2.5.1 Windows 2003 Guest OS.
I cant see us reaching any limitation on the Filers or the SAN. Yet we
have random VMs failing snapshots every night. Are other people seeing
these issues? (I've gone through the gamut of troubleshooting, version
management of ESX/VMWareTools/etc). Snapshots timeout and fail at the VMWare/Guest level, not at the Netapp snapshot level.
We want to have SMVI function with VSS enabled.
Has anyone had failing snapshots been able to resolve a similar issue?
Or does anyone have SMVI working properly that we could use as a reference to compare configuration?
Ken Williams Storage Administrator, Business Technology Operations Sacramento Municipal Utility District E-Mail: kwillia@smud.org Phone: (916) 732-6744 Cell: (916) 240-4213