Hi all,
So I've been able to reallocate about 20% of our luns so far, and it's already made a huge improvement. I still see some huge latency spikes (up to 500ms) on some luns, mostly ones I haven't reallocated yet, but the spikes I originally described, where I had spikes on every volume in the aggregate, sometimes the entire filer, at the same time, have disappeared. Our VMware guy and the Citrix team are also reporting performance improvements. I must admit, I'm surprised by how quickly reallocate has improved this environment, to the point where users are already noticing it, and I'm nowhere near finished. If you've never checked your reallocation measurement, I highly recommend it! Thanks especially to Jeff Mohler and Fletcher Cocquyt (I found your blog posts through google before you'd emailed and thank you for writing them, they are really good).
I was thinking over the weekend, as you do, about why I'd sometimes see spikes on multiple aggregates at once, and a common denominator there would be the backend loops. CPU has never been an issue. Does it make sense that the loops become a bottleneck on systems with severe fragmentation? Everything on the filer queues up there behind the IO going to one disk?
Anyway, the question I really would like answered this time is: How best to balance reallocation and de-dupe. On Friday I reallocated a lun that's been de-duplicated, and the measurement changed from 6, hotspot 0, to 4, hotspot 28. Obviously the hotspot is due to the de-duplicated blocks. So, is it better to leave a higher reallocate measure (in this case, 6), with no hotspot, or better to reallocate these luns to lower the overall measure, even though it creates a horrible hotspot?
Thanks, Peta
On 30 April 2012 03:29, Milazzo Giacomo G.Milazzo@sinergy.it wrote:
At last am answer with sense J
I explain it better. I’m referring to what Fletcher wrote: “It was maddening to me back in 2010 how netapp support could blockade support cases with a blanket "must align VMs first" without a real quantification of the impact of misalignment – see”
We had some critical case of this “performances” issues and in both cases the first thing that (support) asked to do: realign!
Well. After a long and tedious process of realignment things were better…meanly 5% better, not more!!! And don’t let customer see graphs coming from the CMPG portal! Something like this attached…this could be pure terrorism :-D and a lot of nightmare for you trying to explain (and understand) the meaning of buckets J
Alignment is important…but NOT SO important. I concentrate investigation first on other levels.
Reagards,
Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Fletcher Cocquyt Inviato: venerdì 27 aprile 2012 08:46 A: Peta Thames Cc: Toasters@teaparty.net Oggetto: Re: Reallocation and VMware
Peta - we were dealing with this very issue (unexplained latency spikes Netapp blamed on VM misalignment)
back in 2010 - I wrote up how we deconstructed the IOPs after many wasted perfstat iterations
to solve it pretty much on our own:
http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html
It was maddening to me back in 2010 how netapp support could blockade support cases with a
blanket "must align VMs first" without a real quantification of the impact of misalignment - see
http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html
We ended up taking the downtime back then to align all VMs
But now, I would be one to encourage your making the leap to 8.x - we are on 8.1GA and we are not looking back.
The data motion of vFilers is allowing us to upgrade clusters with no downtime
http://www.vmadmin.info/2012/04/meta-storage-vmotion-netapp-datamotion.html
They have me almost believing in cluster mode for scale out...
On Apr 25, 2012, at 11:40 PM, Peta Thames wrote:
Hi Jack,
You're right, and I should have mentioned it before. Large numbers of the VMDKs are misaligned. I'd estimate about 33%, but I don't know exactly how many as the shiny new VSC scanner got stuck halfway through the scan I ran, leaving several VMs in a "being scanned" state. I have a case open with Netapp to find out how to get those VMs out of that state so I can a) continue the scan b) schedule fixing the misaligned luns.
Not all the luns that have large latency spikes are misaligned however. Mind you, by the same token, not all of them are fragmented, although so far (I'm still getting through measuring them all) there's definitely a strong correlation.
I also have to admit that I read the scale wrong in perf advisor, and the numbers I'm seeing are in microseconds, not milliseconds. Still way more than the 10ms I would like, but an order of magnitude better!
Peta
On 26 April 2012 15:52, Jack Lyons jack1729@gmail.com wrote:
Have you checked the alignment of the VMDK's?
Jack
Sent from my Verizon Wireless BlackBerry
-----Original Message-----
From: Peta Thames petathames@gmail.com
Sender: toasters-bounces@teaparty.net
Date: Thu, 26 Apr 2012 14:49:43
Subject: Reallocation and VMware
Hi all,
I'd like to pick your collective brains about your experiences with
reallocate, specifically when reallocating luns under VMware.
For background, we're running ONTAP 8.0.1 on a 3170 that's over three
years old. I've been going through measuring reallocation, and most
of the volumes are over 3. We have no snapshots, and only a
relatively small number of volumes are de-duplicated. All our volumes
and luns are thin-provisioned, and no aggregate is more than 76% full
(most are ~65%). We regularly have huge latency spikes (worst I've
seen so far is 5000000ms, and there are far too many to even track
over 50000ms daily), and on one filer head, but not its partner, I
regularly see disk utilisation go to 100% or more. I'm hoping
reallocate will help here.
I have a brief note from a NetApp support person who says "It’s very
important that you complete the reallocation in the following order:
1:OS 2:LUN 3: Volume".
I have two questions about this:
- is it absolutely necessary to defrag the OS before you reallocate
the lun? I'm sure I've run reallocate without defraging the OS and
still seen performance improvements. I'm also assuming that this is
only relevant to Windows VMs, not Linux (in our case, Red Hat/CentOS)
ones.
- if you only have one lun per volume, do you still need to run
reallocate on both the lun and the volume? If only one, which is
preferable?
All advice appreciated.
Thanks,
Peta
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
:)
Sent from my iPhone
On May 13, 2012, at 4:54 PM, Peta Thames petathames@gmail.com wrote:
Hi all,
So I've been able to reallocate about 20% of our luns so far, and it's already made a huge improvement. I still see some huge latency spikes (up to 500ms) on some luns, mostly ones I haven't reallocated yet, but the spikes I originally described, where I had spikes on every volume in the aggregate, sometimes the entire filer, at the same time, have disappeared. Our VMware guy and the Citrix team are also reporting performance improvements. I must admit, I'm surprised by how quickly reallocate has improved this environment, to the point where users are already noticing it, and I'm nowhere near finished. If you've never checked your reallocation measurement, I highly recommend it! Thanks especially to Jeff Mohler and Fletcher Cocquyt (I found your blog posts through google before you'd emailed and thank you for writing them, they are really good).
I was thinking over the weekend, as you do, about why I'd sometimes see spikes on multiple aggregates at once, and a common denominator there would be the backend loops. CPU has never been an issue. Does it make sense that the loops become a bottleneck on systems with severe fragmentation? Everything on the filer queues up there behind the IO going to one disk?
Anyway, the question I really would like answered this time is: How best to balance reallocation and de-dupe. On Friday I reallocated a lun that's been de-duplicated, and the measurement changed from 6, hotspot 0, to 4, hotspot 28. Obviously the hotspot is due to the de-duplicated blocks. So, is it better to leave a higher reallocate measure (in this case, 6), with no hotspot, or better to reallocate these luns to lower the overall measure, even though it creates a horrible hotspot?
Thanks, Peta
On 30 April 2012 03:29, Milazzo Giacomo G.Milazzo@sinergy.it wrote:
At last am answer with sense J
I explain it better. I’m referring to what Fletcher wrote: “It was maddening to me back in 2010 how netapp support could blockade support cases with a blanket "must align VMs first" without a real quantification of the impact of misalignment – see”
We had some critical case of this “performances” issues and in both cases the first thing that (support) asked to do: realign!
Well. After a long and tedious process of realignment things were better…meanly 5% better, not more!!! And don’t let customer see graphs coming from the CMPG portal! Something like this attached…this could be pure terrorism :-D and a lot of nightmare for you trying to explain (and understand) the meaning of buckets J
Alignment is important…but NOT SO important. I concentrate investigation first on other levels.
Reagards,
Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Fletcher Cocquyt Inviato: venerdì 27 aprile 2012 08:46 A: Peta Thames Cc: Toasters@teaparty.net Oggetto: Re: Reallocation and VMware
Peta - we were dealing with this very issue (unexplained latency spikes Netapp blamed on VM misalignment)
back in 2010 - I wrote up how we deconstructed the IOPs after many wasted perfstat iterations
to solve it pretty much on our own:
http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html
It was maddening to me back in 2010 how netapp support could blockade support cases with a
blanket "must align VMs first" without a real quantification of the impact of misalignment - see
http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html
We ended up taking the downtime back then to align all VMs
But now, I would be one to encourage your making the leap to 8.x - we are on 8.1GA and we are not looking back.
The data motion of vFilers is allowing us to upgrade clusters with no downtime
http://www.vmadmin.info/2012/04/meta-storage-vmotion-netapp-datamotion.html
They have me almost believing in cluster mode for scale out...
On Apr 25, 2012, at 11:40 PM, Peta Thames wrote:
Hi Jack,
You're right, and I should have mentioned it before. Large numbers of the VMDKs are misaligned. I'd estimate about 33%, but I don't know exactly how many as the shiny new VSC scanner got stuck halfway through the scan I ran, leaving several VMs in a "being scanned" state. I have a case open with Netapp to find out how to get those VMs out of that state so I can a) continue the scan b) schedule fixing the misaligned luns.
Not all the luns that have large latency spikes are misaligned however. Mind you, by the same token, not all of them are fragmented, although so far (I'm still getting through measuring them all) there's definitely a strong correlation.
I also have to admit that I read the scale wrong in perf advisor, and the numbers I'm seeing are in microseconds, not milliseconds. Still way more than the 10ms I would like, but an order of magnitude better!
Peta
On 26 April 2012 15:52, Jack Lyons jack1729@gmail.com wrote:
Have you checked the alignment of the VMDK's?
Jack
Sent from my Verizon Wireless BlackBerry
-----Original Message-----
From: Peta Thames petathames@gmail.com
Sender: toasters-bounces@teaparty.net
Date: Thu, 26 Apr 2012 14:49:43
Subject: Reallocation and VMware
Hi all,
I'd like to pick your collective brains about your experiences with
reallocate, specifically when reallocating luns under VMware.
For background, we're running ONTAP 8.0.1 on a 3170 that's over three
years old. I've been going through measuring reallocation, and most
of the volumes are over 3. We have no snapshots, and only a
relatively small number of volumes are de-duplicated. All our volumes
and luns are thin-provisioned, and no aggregate is more than 76% full
(most are ~65%). We regularly have huge latency spikes (worst I've
seen so far is 5000000ms, and there are far too many to even track
over 50000ms daily), and on one filer head, but not its partner, I
regularly see disk utilisation go to 100% or more. I'm hoping
reallocate will help here.
I have a brief note from a NetApp support person who says "It’s very
important that you complete the reallocation in the following order:
1:OS 2:LUN 3: Volume".
I have two questions about this:
- is it absolutely necessary to defrag the OS before you reallocate
the lun? I'm sure I've run reallocate without defraging the OS and
still seen performance improvements. I'm also assuming that this is
only relevant to Windows VMs, not Linux (in our case, Red Hat/CentOS)
ones.
- if you only have one lun per volume, do you still need to run
reallocate on both the lun and the volume? If only one, which is
preferable?
All advice appreciated.
Thanks,
Peta
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Not to beat a dead horse here. But here is a trailing question.
I have mostly windows vm's and most were not initially aligned (still have ~ 60vms to go out of a few hundred). I read in the DOT 8 7mode sys admin guide (pg 322), to immediately after creating the lun, to setup a reallocate job. We run VSC and take snapshots daily. I am going to run the following command on each vol. I have 1 vol and 1 lun inside that vol. reallocate -f -p /vol/vmware_vmware_01_sata
Do most folks run reallocate, and if so, how often? These are standard windows (wk203,wk208) servers.. I know the mileage will vary here..
Thanks for posting this Peta.
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Jeff Mother Sent: Sunday, May 13, 2012 5:22 PM To: Peta Thames Cc: Toasters@teaparty.net Subject: Re: Reallocate redux: Reallocation and de-dupe
:)
Sent from my iPhone
On May 13, 2012, at 4:54 PM, Peta Thames petathames@gmail.com wrote:
Hi all,
So I've been able to reallocate about 20% of our luns so far, and it's already made a huge improvement. I still see some huge latency spikes (up to 500ms) on some luns, mostly ones I haven't reallocated yet, but the spikes I originally described, where I had spikes on every volume in the aggregate, sometimes the entire filer, at the same time, have disappeared. Our VMware guy and the Citrix team are also reporting performance improvements. I must admit, I'm surprised by how quickly reallocate has improved this environment, to the point where users are already noticing it, and I'm nowhere near finished. If you've never checked your reallocation measurement, I highly recommend it! Thanks especially to Jeff Mohler and Fletcher Cocquyt (I found your blog posts through google before you'd emailed and thank you for writing them, they are really good).
I was thinking over the weekend, as you do, about why I'd sometimes see spikes on multiple aggregates at once, and a common denominator there would be the backend loops. CPU has never been an issue. Does it make sense that the loops become a bottleneck on systems with severe fragmentation? Everything on the filer queues up there behind the IO going to one disk?
Anyway, the question I really would like answered this time is: How best to balance reallocation and de-dupe. On Friday I reallocated a lun that's been de-duplicated, and the measurement changed from 6, hotspot 0, to 4, hotspot 28. Obviously the hotspot is due to the de-duplicated blocks. So, is it better to leave a higher reallocate measure (in this case, 6), with no hotspot, or better to reallocate these luns to lower the overall measure, even though it creates a horrible hotspot?
Thanks, Peta
On 30 April 2012 03:29, Milazzo Giacomo G.Milazzo@sinergy.it wrote:
At last am answer with sense J
I explain it better. I’m referring to what Fletcher wrote: “It was maddening to me back in 2010 how netapp support could blockade support cases with a blanket "must align VMs first" without a real quantification of the impact of misalignment – see”
We had some critical case of this “performances” issues and in both cases the first thing that (support) asked to do: realign!
Well. After a long and tedious process of realignment things were better…meanly 5% better, not more!!! And don’t let customer see graphs coming from the CMPG portal! Something like this attached…this could be pure terrorism :-D and a lot of nightmare for you trying to explain (and understand) the meaning of buckets J
Alignment is important…but NOT SO important. I concentrate investigation first on other levels.
Reagards,
Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Fletcher Cocquyt Inviato: venerdì 27 aprile 2012 08:46 A: Peta Thames Cc: Toasters@teaparty.net Oggetto: Re: Reallocation and VMware
Peta - we were dealing with this very issue (unexplained latency spikes Netapp blamed on VM misalignment)
back in 2010 - I wrote up how we deconstructed the IOPs after many wasted perfstat iterations
to solve it pretty much on our own:
http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html
It was maddening to me back in 2010 how netapp support could blockade support cases with a
blanket "must align VMs first" without a real quantification of the impact of misalignment - see
http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html
We ended up taking the downtime back then to align all VMs
But now, I would be one to encourage your making the leap to 8.x - we are on 8.1GA and we are not looking back.
The data motion of vFilers is allowing us to upgrade clusters with no downtime
http://www.vmadmin.info/2012/04/meta-storage-vmotion-netapp-datamotion.html
They have me almost believing in cluster mode for scale out...
On Apr 25, 2012, at 11:40 PM, Peta Thames wrote:
Hi Jack,
You're right, and I should have mentioned it before. Large numbers of the VMDKs are misaligned. I'd estimate about 33%, but I don't know exactly how many as the shiny new VSC scanner got stuck halfway through the scan I ran, leaving several VMs in a "being scanned" state. I have a case open with Netapp to find out how to get those VMs out of that state so I can a) continue the scan b) schedule fixing the misaligned luns.
Not all the luns that have large latency spikes are misaligned however. Mind you, by the same token, not all of them are fragmented, although so far (I'm still getting through measuring them all) there's definitely a strong correlation.
I also have to admit that I read the scale wrong in perf advisor, and the numbers I'm seeing are in microseconds, not milliseconds. Still way more than the 10ms I would like, but an order of magnitude better!
Peta
On 26 April 2012 15:52, Jack Lyons jack1729@gmail.com wrote:
Have you checked the alignment of the VMDK's?
Jack
Sent from my Verizon Wireless BlackBerry
-----Original Message-----
From: Peta Thames petathames@gmail.com
Sender: toasters-bounces@teaparty.net
Date: Thu, 26 Apr 2012 14:49:43
Subject: Reallocation and VMware
Hi all,
I'd like to pick your collective brains about your experiences with
reallocate, specifically when reallocating luns under VMware.
For background, we're running ONTAP 8.0.1 on a 3170 that's over three
years old. I've been going through measuring reallocation, and most
of the volumes are over 3. We have no snapshots, and only a
relatively small number of volumes are de-duplicated. All our volumes
and luns are thin-provisioned, and no aggregate is more than 76% full
(most are ~65%). We regularly have huge latency spikes (worst I've
seen so far is 5000000ms, and there are far too many to even track
over 50000ms daily), and on one filer head, but not its partner, I
regularly see disk utilisation go to 100% or more. I'm hoping
reallocate will help here.
I have a brief note from a NetApp support person who says "It’s very
important that you complete the reallocation in the following order:
1:OS 2:LUN 3: Volume".
I have two questions about this:
- is it absolutely necessary to defrag the OS before you reallocate
the lun? I'm sure I've run reallocate without defraging the OS and
still seen performance improvements. I'm also assuming that this is
only relevant to Windows VMs, not Linux (in our case, Red Hat/CentOS)
ones.
- if you only have one lun per volume, do you still need to run
reallocate on both the lun and the volume? If only one, which is
preferable?
All advice appreciated.
Thanks,
Peta
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters