8.1.1 RLW Performance Trouble, 8.1.2 issues? - toasters

3 Dec 2012


      Hello All,
After a few months of planning and preparation we were just about to pull the trigger on an 8.1.1 upgrade to our v3170 currently running 7.3.5.1P5 with a 3Par T800 spinning 528 SATA spindles. We've crawled through release notes and upgrade notes, used the online "upgrade advisor" tools, checked this list and engaged our NetApp SE. Tested against a simulator , killed a chicken at midnight of the full moon, etc.
Then last week one of our admins found this link and asked in passing if this issue was relevant to our situation:
Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues
https://communities.netapp.com/thread/22676
Looks like upgrading to 8.1x implements a new layer of protection called RLW (RAID protection from Lost Writes). This requires the addition of some metadata to the disk system and after upgrade a background process "rlw_update" runs for some period of time. The trouble is, this process does not "nice" itself as it's meant to when other processes are running. Worse, if it runs at the same time as other "not so nice" processes like de-dupe there can be disastrous performance issues. This problem is exacerbated if disk utilization is high, or if slower disks are used, or if a lot of misaligned traffic is running. Users in the wild have reported the rlw_update process taking several weeks and horrible performance issues during its tenure.
Yike, I'm just glad we found out about this now and not Monday morning. Our filer consistently runs ~90% disk utilization, is usually running two or three de-dupe processes, is running on SATA disks, and one node does nothing but serve up NFS datastores to our VMware farm which we suspect is running mostly misaligned VMDK files.
So now we've postponed the upgrade. We'll need to retest against 8.1.2 (or later). We're averse to get too eager to upgrade to 8.1.2 until it's been out a while, especially as NetApp seems to be on a roll lately, releasing new versions to address performance issues that then seem have performance issues. We're also concerned about upgrading to any new version while the system is heavily loaded (we're also in the process of deploying a new block-level storage system that will offload more than half of the current performance load on the v3170).
So. This post is intended first as a heads-up to anyone in a similar situation who's about upgrade to 8.1.1. Another takeaway might be to acknowledge the value of the official NetApp Communities pages: I've usually relied on this toasters list and/or our NetApp/VAR technical staff, but will also be following the "official" user community resources from now on.
And I also welcome any feedback from anyone with experience or information to offer. Anyone been in this situation? Anyone running 8.1.2 yet? Anyone have advice on upgrading a significantly busy system?
Hope to hear from you,
Randy Rue 
Seattle