Hi,
We're probably going to ba adding another FCAL controller to our 840 soon. Can volumes include disks that are on both controllers?
----- Original Message ----- From: "Matt Phelps" mphelps@cfa.harvard.edu To: toasters@mathworks.com Sent: Tuesday, January 09, 2001 11:52 AM Subject: Can volumes span FCAL controllers?
Hi,
We're probably going to ba adding another FCAL controller to our 840 soon. Can volumes include disks that are on both controllers?
Yes, but there's a slight performance penalty in some cases on heavy writes and thus it is not recommended.
Bruce
That's not entirely true -- you can span a volume across controllers, just dont span RAID groups across controllers.
Bruce Sterling Woodcock wrote:
----- Original Message ----- From: "Matt Phelps" mphelps@cfa.harvard.edu To: toasters@mathworks.com Sent: Tuesday, January 09, 2001 11:52 AM Subject: Can volumes span FCAL controllers?
Hi,
We're probably going to ba adding another FCAL controller to our 840 soon. Can volumes include disks that are on both controllers?
Yes, but there's a slight performance penalty in some cases on heavy writes and thus it is not recommended.
Bruce
-- Jason Santos UNIX System Administrator ON Semiconductor jason.santos@onsemi.com (602) 244-3769
----- Original Message ----- From: "Jason Santos" jason.santos@onsemi.com To: "Bruce Sterling Woodcock" sirbruce@ix.netcom.com Cc: "Matt Phelps" mphelps@cfa.harvard.edu; toasters@mathworks.com Sent: Thursday, January 11, 2001 7:07 PM Subject: Re: Can volumes span FCAL controllers?
That's not entirely true -- you can span a volume across controllers, just dont span RAID groups across controllers.
When a volume has 2 RAID groups, is the NVRAM split among RAID groups? How are CPs done?
Bruce
That's not entirely true -- you can span a volume across controllers, just dont span RAID groups across controllers.
When a volume has 2 RAID groups, is the NVRAM split among RAID groups? How are CPs done?
Bruce
As I understand it, NVRAM is used for logging write requests from clients -- not as a disk buffer cache. The filer periodically generates consistency points where the disk volumes are perfectly consistent. These occur no less often than every 10 seconds.
In order to update the disks efficiently, the filer allows write requests to accumulate for awhile and commits them in a coordinated fashion. This greatly reduces the load on the parity drives. (If you write a bunch of blocks in the same stripe at the same time, you only have to update the parity block once.)
Once a consistency point has been generated, the NVRAM log is cleaned up to make room for logging more incoming write requests.
Due to the design of WAFL, writing a new consistency point does not undo or damage the previous consistency point.
Here's how the filer recovers from a crash or loss of power when the volumes are inconsistent. When the filer comes up, it reverts back to the most recent consistency point (no more than 10 sec old) and replays all the write requests logged in NVRAM that arrived after the consistency point was generated.
So in answer to your question, the NVRAM is shared by all volumes and raid groups because it is a log of incoming write requests, not a disk buffer cache.
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
Steve Losen scl@sasha.acc.virginia.edu wrote:
[in response to Bruce Sterling Woodcock sirbruce@ix.netcom.com]
That's not entirely true -- you can span a volume across controllers, just dont span RAID groups across controllers.
When a volume has 2 RAID groups, is the NVRAM split among RAID groups? How are CPs done?
[ Good introduction to consistency points for newbies ]
So in answer to your question, the NVRAM is shared by all volumes and raid groups because it is a log of incoming write requests, not a disk buffer cache.
But to resume the original thread, the question is whether or not the writes done as part of CPs are clustered in a way that helps to reduce the overheads of switching between FCAL controllers.
CPs for different volumes are logically distinct operations, but are in practice synchronised, either by the 10-second clock or by NVRAM filling up. In saying that one should avoid spreading a volume over multiple controllers, but can have different volumes on different controllers, the assumption is that writes (mostly) occur first to one volume, then to another.
If the same is to apply to RAID groups, then the assumption is that the writes associated with taking a CP on a single volume are (mostly) clustered by RAID group. This sounds entirely reasonable, given that the filer tries to write whole stripes, or at least stripes in which as many planes as possible are being updated.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
Hello Toasters!
I hope to clarify the multiple controller issue with a fairly long winded note. If I leave out anyone's point on RAID groups, or volumes, spanning FCAL controllers, let me know; or let the whole list know. I'm game.
I suspect this precautionary rule of thumb arose from a hardware bug in the 700 series, burt 19290. There's a short description on NOW here:
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=19290
It's cryptic and brief, and should probably be reworded. I'll look into that. The long story is, arbitrating multiple PCI bus requests on 700 series hardware didn't go as smoothly as we'd hope, resulting in performance degradation in this case. How often does the case come up, you ask? The worst case is when writes over 100tx ethernet go to a quad ethernet card (which causes a lot of interrupts) in slot 1, 2 or 3; an FCAL controller is also in slot 1, 2 or 3; and another FCAL controller with destination disks for the write op is in slot 4, 5, 6 or 7. It's worth pointing out that gigabit controllers buffer data much more efficiently, and aren't throttled by waiting for PCI interrupts. A single ethernet card doesn't have so much data to unload, so it can get the bus much more readily. A quad card has lots of data to unload, with very little buffer space on the controller. Scheduling interrupts for the NIC to unload the data, and two other interrupts to load data on to the two FCAL controllers, will not go as fast as the quad card would like. This is why the NIC starts reporting h/w overflows and bus underruns in ifstat.
For the raid group / volume distinction, writes are allocated per volume, and the free space in the given raid groups will determine where the writes go. NVRAM isn't divided on a per-volume or per-raid group basis. The write data, by and large, isn't in NVRAM. The write transaction description is in NVRAM until the write is committed to disk, normally with the data being served from system memory. Teeny tiny writes are the exception to this. If you blow out an FCAL controller (which really doesn't happen that often), WAFL won't see the disks on the end of the controller. If either a volume or a raid group in a given volume is split across this hypothetically blown controller, the affected volume will lose disks. If it loses more than two in a raid group, it'll go away. Performance and redundancy are at odds for this configuration, but the exposure is low.
So, the following configurations are safe for multi-adapter volumes:
+ 800-series filers + GbE workloads + non write-intensive workloads + writes directed at multiple volumes simultaneously + 700-series implementing the workaround in burt 19290
Anyone left?
WAFL volumes can, and often do, span FCAL adapters. In many cases there is a performance benefit from spreading volumes across multiple FCALs, due to improved load balancing between adapters. Performance will not suffer if you split a RAID group (or volume) across adapters, barring bug 19290. Most of our F840 performance benchmarks are run in this configuration. Follow up questions are welcome, but may not be answered until Monday. Enjoy!
For the raid group / volume distinction, writes are allocated per volume, and the free space in the given raid groups will determine where the
writes
go. NVRAM isn't divided on a per-volume or per-raid group basis. The write data, by and large, isn't in NVRAM. The write transaction description is in NVRAM until the write is committed to disk, normally
with
the data being served from system memory. Teeny tiny writes are the exception to this.
Yes, I understand that writes are logged simultaneously to both RAM and NVRAM and the writes come from RAM and NVRAM is only used in case of a crash. However, the data in RAM has to be written to disk before the NVRAM can be flushed. Normally NVRAM is divided into two sections anyway. The question is, with volumes and groups, is NVRAM divided further? When a CP is triggered, either by timer or log full, does it write out all volumes or all groups or one volume or one group or what?
If one write has to span two different controllers, then there has to be at least a minimal timing impact. The impact will, of couse, be "behind the scenes" and not effect the response time unless writes are sufficiently intensive that you're doing a cp_from_cp.
Bruce
* It was Sat, Jan 13, 2001 at 12:01:52AM -0800 when Bruce Sterling Woodcock wrote:
For the raid group / volume distinction, writes are allocated per volume, and the free space in the given raid groups will determine where the
writes
go. NVRAM isn't divided on a per-volume or per-raid group basis. The write data, by and large, isn't in NVRAM. The write transaction description is in NVRAM until the write is committed to disk, normally
with
the data being served from system memory. Teeny tiny writes are the exception to this.
Yes, I understand that writes are logged simultaneously to both RAM and NVRAM and the writes come from RAM and NVRAM is only used in case of a crash. However, the data in RAM has to be written to disk before the NVRAM can be flushed.
Not necessarily. If the filer doesn't crash, yes. The write is reported successful to the client as soon as it hits NVRAM.
The question is, with volumes and groups, is NVRAM divided further?
No, NVRAM isn't divided on a per-volume or per-raid group basis.
When a CP is triggered, either by timer or log full, does it write out all volumes or all groups or one volume or one group or what?
It writes out data from all volumes that had writes to it.
If one write has to span two different controllers, then there has to be at least a minimal timing impact. The impact will, of couse, be "behind the scenes" and not effect the response time unless writes are sufficiently intensive that you're doing a cp_from_cp.
But, you're sending out data to two controllers, with two different write queues, talking to different disks. The time spent passing along the interrupt compared to the time spent doing the i/o, is negligible.