Hello All,
After a few months of planning and preparation we were just about to pull the trigger on an 8.1.1 upgrade to our v3170 currently running 7.3.5.1P5 with a 3Par T800 spinning 528 SATA spindles. We've crawled through release notes and upgrade notes, used the online "upgrade advisor" tools, checked this list and engaged our NetApp SE. Tested against a simulator , killed a chicken at midnight of the full moon, etc.
Then last week one of our admins found this link and asked in passing if this issue was relevant to our situation:
Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues
https://communities.netapp.com/thread/22676
Looks like upgrading to 8.1x implements a new layer of protection called RLW (RAID protection from Lost Writes). This requires the addition of some metadata to the disk system and after upgrade a background process "rlw_update" runs for some period of time. The trouble is, this process does not "nice" itself as it's meant to when other processes are running. Worse, if it runs at the same time as other "not so nice" processes like de-dupe there can be disastrous performance issues. This problem is exacerbated if disk utilization is high, or if slower disks are used, or if a lot of misaligned traffic is running. Users in the wild have reported the rlw_update process taking several weeks and horrible performance issues during its tenure.
Yike, I'm just glad we found out about this now and not Monday morning. Our filer consistently runs ~90% disk utilization, is usually running two or three de-dupe processes, is running on SATA disks, and one node does nothing but serve up NFS datastores to our VMware farm which we suspect is running mostly misaligned VMDK files.
So now we've postponed the upgrade. We'll need to retest against 8.1.2 (or later). We're averse to get too eager to upgrade to 8.1.2 until it's been out a while, especially as NetApp seems to be on a roll lately, releasing new versions to address performance issues that then seem have performance issues. We're also concerned about upgrading to any new version while the system is heavily loaded (we're also in the process of deploying a new block-level storage system that will offload more than half of the current performance load on the v3170).
So. This post is intended first as a heads-up to anyone in a similar situation who's about upgrade to 8.1.1. Another takeaway might be to acknowledge the value of the official NetApp Communities pages: I've usually relied on this toasters list and/or our NetApp/VAR technical staff, but will also be following the "official" user community resources from now on.
And I also welcome any feedback from anyone with experience or information to offer. Anyone been in this situation? Anyone running 8.1.2 yet? Anyone have advice on upgrading a significantly busy system?
Hope to hear from you,
Randy Rue Seattle
Did you actually see “rlw_update” process in “ps” output? Could you show example of “ps” output where this process is seen? This is the first time I hear about “rlw_update” *process*. There is “rlw_upgrading” aggregate flag …
I have here one filer where RLW update still runs and I do not see this process. RLW upgrading is performed as part of normal aggregate scrubbing. May be you confuse extra load caused by aggregate scrubbing for load caused by RLW upgrading?
The thread you refer to intermixes at least half a dozen of different performance related problems and none of these problems is related to RLW at the end. The thread is actually pretty bad information source because it is no more possible to understand which problem is discussed.
Also on forums.netapp.com NetApp employee gave pretty good explanation of what RLW upgrade is.
-andrey
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Randy Rue Sent: Tuesday, December 04, 2012 2:37 AM To: toasters@teaparty.net Subject: 8.1.1 RLW Performance Trouble, 8.1.2 issues?
Hello All,
After a few months of planning and preparation we were just about to pull the trigger on an 8.1.1 upgrade to our v3170 currently running 7.3.5.1P5 with a 3Par T800 spinning 528 SATA spindles. We've crawled through release notes and upgrade notes, used the online "upgrade advisor" tools, checked this list and engaged our NetApp SE. Tested against a simulator, killed a chicken at midnight of the full moon, etc.
Then last week one of our admins found this link and asked in passing if this issue was relevant to our situation: Data Ontap 8.1 upgrade - RLW_Upgrading process and other issueshttps://communities.netapp.com/message/84882 https://communities.netapp.com/thread/22676
Looks like upgrading to 8.1x implements a new layer of protection called RLW (RAID protection from Lost Writes). This requires the addition of some metadata to the disk system and after upgrade a background process "rlw_update" runs for some period of time. The trouble is, this process does not "nice" itself as it's meant to when other processes are running. Worse, if it runs at the same time as other "not so nice" processes like de-dupe there can be disastrous performance issues. This problem is exacerbated if disk utilization is high, or if slower disks are used, or if a lot of misaligned traffic is running. Users in the wild have reported the rlw_update process taking several weeks and horrible performance issues during its tenure.
Yike, I'm just glad we found out about this now and not Monday morning. Our filer consistently runs ~90% disk utilization, is usually running two or three de-dupe processes, is running on SATA disks, and one node does nothing but serve up NFS datastores to our VMware farm which we suspect is running mostly misaligned VMDK files.
So now we've postponed the upgrade. We'll need to retest against 8.1.2 (or later). We're averse to get too eager to upgrade to 8.1.2 until it's been out a while, especially as NetApp seems to be on a roll lately, releasing new versions to address performance issues that then seem have performance issues. We're also concerned about upgrading to any new version while the system is heavily loaded (we're also in the process of deploying a new block-level storage system that will offload more than half of the current performance load on the v3170).
So. This post is intended first as a heads-up to anyone in a similar situation who's about upgrade to 8.1.1. Another takeaway might be to acknowledge the value of the official NetApp Communities pages: I've usually relied on this toasters list and/or our NetApp/VAR technical staff, but will also be following the "official" user community resources from now on.
And I also welcome any feedback from anyone with experience or information to offer. Anyone been in this situation? Anyone running 8.1.2 yet? Anyone have advice on upgrading a significantly busy system?
Hope to hear from you,
Randy Rue Seattle
Based on what I had read on the communities.netapp.com post, I did the following when I upgraded two lightly-used 3240s to 8.1.1P1:
* Upgraded both filers to Data ONTAP 8.1.1P1 * Temporarily disabled dedupe * Performed an "aggr scrub" on each of the aggregates (took around a day or two to complete) * Re-enabled dedupe
After performing the aforementioned steps I noticed no gain in CPU usage and the system is as responsive as it was before. I am still concerned however that the upgrade will have a more noticeable impact on more heavily used systems such as our PROD 3240s & 6080s.
The bottom line is that this release and possibly future releases will continue to require more resources due to feature addition thus making making the effects of minor issues such as LUN misalignment more severe.
Dan
From: Randy Rue <rrue@fhcrc.orgmailto:rrue@fhcrc.org> Date: Monday, December 3, 2012 4:37 PM To: "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: 8.1.1 RLW Performance Trouble, 8.1.2 issues?
Hello All,
After a few months of planning and preparation we were just about to pull the trigger on an 8.1.1 upgrade to our v3170 currently running 7.3.5.1P5 with a 3Par T800 spinning 528 SATA spindles. We've crawled through release notes and upgrade notes, used the online "upgrade advisor" tools, checked this list and engaged our NetApp SE. Tested against a simulator, killed a chicken at midnight of the full moon, etc.
Then last week one of our admins found this link and asked in passing if this issue was relevant to our situation: Data Ontap 8.1 upgrade - RLW_Upgrading process and other issueshttps://communities.netapp.com/message/84882 https://communities.netapp.com/thread/22676
Looks like upgrading to 8.1x implements a new layer of protection called RLW (RAID protection from Lost Writes). This requires the addition of some metadata to the disk system and after upgrade a background process "rlw_update" runs for some period of time. The trouble is, this process does not "nice" itself as it's meant to when other processes are running. Worse, if it runs at the same time as other "not so nice" processes like de-dupe there can be disastrous performance issues. This problem is exacerbated if disk utilization is high, or if slower disks are used, or if a lot of misaligned traffic is running. Users in the wild have reported the rlw_update process taking several weeks and horrible performance issues during its tenure.
Yike, I'm just glad we found out about this now and not Monday morning. Our filer consistently runs ~90% disk utilization, is usually running two or three de-dupe processes, is running on SATA disks, and one node does nothing but serve up NFS datastores to our VMware farm which we suspect is running mostly misaligned VMDK files.
So now we've postponed the upgrade. We'll need to retest against 8.1.2 (or later). We're averse to get too eager to upgrade to 8.1.2 until it's been out a while, especially as NetApp seems to be on a roll lately, releasing new versions to address performance issues that then seem have performance issues. We're also concerned about upgrading to any new version while the system is heavily loaded (we're also in the process of deploying a new block-level storage system that will offload more than half of the current performance load on the v3170).
So. This post is intended first as a heads-up to anyone in a similar situation who's about upgrade to 8.1.1. Another takeaway might be to acknowledge the value of the official NetApp Communities pages: I've usually relied on this toasters list and/or our NetApp/VAR technical staff, but will also be following the "official" user community resources from now on.
And I also welcome any feedback from anyone with experience or information to offer. Anyone been in this situation? Anyone running 8.1.2 yet? Anyone have advice on upgrading a significantly busy system?
Hope to hear from you,
Randy Rue Seattle
On 04/12/12 06:37, Randy Rue wrote:
Looks like upgrading to 8.1x implements a new layer of protection called RLW (RAID protection from Lost Writes). This requires the addition of some metadata to the disk system and after upgrade a background process "rlw_update" runs for some period of time. The trouble is, this process does not "nice" itself as it's meant to when other processes are running. Worse, if it runs at the same time as other "not so nice" processes like de-dupe there can be disastrous performance issues. This problem is exacerbated if disk utilization is high, or if slower disks are used, or if a lot of misaligned traffic is running. Users in the wild have reported the rlw_update process taking several weeks and horrible performance issues during its tenure.
I came across this KB article linked from the support community FAQs https://kb.netapp.com/support/index?page=content&id=3013583 which says:
"Note: 'rlw_upgrading' is just a flag/state, it does not indicate an active process running in the background. This means that there is 'no' background process impacting the storage system's performance. The only performance impact expected is that of scrub, which can be scheduled and stopped by the usual means (for more information, see the 'aggr scrub' man pages). The active process of performing the upgrade is included as part of a RAID scrub. A full manual scrub will not be initiated automatically following an upgrade of Data ONTAP. The aggr scrub status command will indicate if RAID scrubs are currently suspended (not actively running at that moment)."
And I also welcome any feedback from anyone with experience or information to offer. Anyone been in this situation? Anyone running 8.1.2 yet? Anyone have advice on upgrading a significantly busy system?
I upgraded to 8.1.1p1 a few weeks ago, and my root aggregates are now rlw_on, but my other ones are still rlw_upgrading. I haven't noticed any performance problems, but they're not heavily loaded.