How fast can I realistically expect an F740 to write out files doing a restore? I have an F740 that is currently doing nothing else, with a 10/100 Mbps Ethernet NIC connected to a Cisco 3548XL switch, and one shelf of 7x36GB drives. I'm dumping from a Sun E420R (4x450-MHz, 50GB Barracuda's on LSILogic Ultra2/LVD adapters) connected to the same switch via 1000base-SX. The command line used is:
ufsdump 0bf 128 - /dev/rdsk/c1t0d0s0 | rsh home-tape restore xbfD 128 - /vol/vol0/local
The F740 (home-tape) seems to be the bottleneck in this case:
# rsh home-tape sysstat 5 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 100% 0 0 0 6316 104 29 6127 0 0 >60 99% 0 0 0 6396 106 28 6558 0 0 >60 100% 0 0 0 6356 105 24 7764 0 0 >60 100% 0 0 0 6316 104 40 7189 0 0 >60 93% 0 0 0 5983 99 28 6112 0 0 >60 100% 0 0 0 6395 106 28 6465 0 0 >60 99% 0 0 0 6272 104 51 8878 0 0 >60 100% 0 0 0 6506 107 36 6149 0 0 >60 [...]
An 8-hour restore window gives enough time to do about 150GB (which, coincidentally, is how much usable space is on 5x36GB data drives ;-)). Any performance tuning tips for doing a fast-as-possible disaster recovery restore like this? I'm going to try streaming data off tape next, instead of kludging it with a ufsdump, but it seems like the Netapp can't go any faster.
Is it possible to tell how much of the CPU is used up in rsh overhead, or TCP/IP overhead? Will I gain much if I used NDMP to stream the data to the filer? What about replacing the Fast Ethernet NIC with a Gigabit NIC that can offload packet checksumming from the main CPU? Anyone have numbers?
What about replacing the Fast Ethernet NIC with a Gigabit NIC that can offload packet checksumming from the main CPU?
This was mentioned by someone earlier today as well.. AFAICT, checksums can only be offloaded to the Alteon's, which are no longer being shipped (in favor of the Intel gbit's).
..kg..
taob@risc.org (Brian Tao) writes:
How fast can I realistically expect an F740 to write out files
doing a restore?
[...]
ufsdump 0bf 128 - /dev/rdsk/c1t0d0s0 | rsh home-tape restore xbfD 128 - /vol/vol0/local
The F740 (home-tape) seems to be the bottleneck in this case:
# rsh home-tape sysstat 5 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 100% 0 0 0 6316 104 29 6127 0 0 >60 99% 0 0 0 6396 106 28 6558 0 0 >60 100% 0 0 0 6356 105 24 7764 0 0 >60 100% 0 0 0 6316 104 40 7189 0 0 >60 93% 0 0 0 5983 99 28 6112 0 0 >60 100% 0 0 0 6395 106 28 6465 0 0 >60 99% 0 0 0 6272 104 51 8878 0 0 >60 100% 0 0 0 6506 107 36 6149 0 0 >60 [...]
I've achieved somewhat higher disk write rates than that during a restore on a F740 from a locally attached DLT7000, and the limiting factor then was the tape speed (restore was essentially the same speed as dump, on an otherwise fairly idle system). I didn't keep careful notes, but my recollection is that the F740 CPU utilisation was under 50%. The network driving overheads do seem to be implicated in your case.
BTW, the ufsdump 'b' is in 512-byte blocks while the ONTAP restore 'b' is in kilobytes, so specifying the latter as 128 is unnecessarily large. I doubt whether that's anything to do with the performance problem, though.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
On Tue, 11 Jul 2000, Chris Thompson wrote:
taob@risc.org (Brian Tao) writes:
# rsh home-tape sysstat 5 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 100% 0 0 0 6316 104 29 6127 0 0 >60 99% 0 0 0 6396 106 28 6558 0 0 >60 100% 0 0 0 6356 105 24 7764 0 0 >60 100% 0 0 0 6316 104 40 7189 0 0 >60 93% 0 0 0 5983 99 28 6112 0 0 >60 100% 0 0 0 6395 106 28 6465 0 0 >60 99% 0 0 0 6272 104 51 8878 0 0 >60 100% 0 0 0 6506 107 36 6149 0 0 >60 [...]
I didn't keep careful notes, but my recollection is that the F740 CPU utilisation was under 50%. The network driving overheads do seem to be implicated in your case.
Well, this is interesting then... I tried a dump of the same local filesystem, but this time the output was piped to a ufsrestore writing to the Netapp via NFS instead of via rsh:
# rsh home-tape sysstat 5 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 39% 223 0 0 6362 41 28 7742 0 0 5 42% 286 0 0 7014 63 18 7819 0 0 5 36% 213 0 0 5635 43 23 7815 0 0 5 39% 669 0 0 5916 133 14 5146 0 0 5 43% 239 0 0 6912 48 21 9060 0 0 5 42% 217 0 0 6789 44 30 7981 0 0 5 46% 220 0 0 7445 44 29 9057 0 0 5 46% 243 0 0 8306 49 18 7718 0 0 5 38% 216 0 0 6370 44 16 7712 0 0 5 [...]
I'm getting similar (if not better) throughput, and less than half the CPU usage this way. I had tried an "rsh dump..." just before this run, and verified that none of the previous results had changed. This is contrary to what I was expecting, thinking that surely an rsh to a local Netapp dump would be faster than Solaris ufsdump over NFS. Anybody at Netapp care to comment?
Next step is to borrow a gigabit Ethernet NIC and see if I get a boost from that.
This latest restore is still running, and I'm waiting to see if this avoids two problems with the rsh method: premature termination and symlink inode count bug. For reasons I have not yet discovered, ufsdump reports "DUMP: Broken pipe" after restore on the Netapp does the finishing "Setting CIFS names to original values" and "Verifying restore correctness" bit. Also, the bug where a restored symlink counts as two inodes in a quota tree is still there in 5.3.6R1. I can empty out the quota tree completely after the restore, and "quota report" still claims (in this case) that there are 9000+ inodes allocated to it.
BTW, the ufsdump 'b' is in 512-byte blocks while the ONTAP restore 'b' is in kilobytes, so specifying the latter as 128 is unnecessarily large. I doubt whether that's anything to do with the performance problem, though.
Nope, no effect, as you suspected.
On Fri, 14 Jul 2000, Brian Tao wrote:
This latest restore is still running, and I'm waiting to see if
this avoids two problems with the rsh method: premature termination and symlink inode count bug. For reasons I have not yet discovered, ufsdump reports "DUMP: Broken pipe" after restore on the Netapp does the finishing "Setting CIFS names to original values" and "Verifying restore correctness" bit. Also, the bug where a restored symlink counts as two inodes in a quota tree is still there in 5.3.6R1. I can empty out the quota tree completely after the restore, and "quota report" still claims (in this case) that there are 9000+ inodes allocated to it.
Yep, both these problems with the local Netapp restore are avoided with Solaris ufsrestore over NFS. Both ufsdump and ufsrestore exit normally after completing the entire filesystem, and there is no double-counting of symlinks on the Netapp side. I'm pretty surprised by this result, since my expectation was that a "ufsdump | rsh restore" would have been the preferred method to restore a filesystem via the network. Instead, it seems to suffer from at least a couple of bugs, chews up more CPU, and offers no performance advantage!