Linux-NFS list recommended I post here -- RE: NetApp versus Linux
All --
The Linux-NFS list recommended I post here. I am looking at either a $35K NetApp F720 or a $18K Linux (w/Ext3+NFS3 patches) solution. I'd greatly appreciate any input anyone has.
My boss favors the NetApp solution on paper (never used one himself, nor does he do any admin-level stuff). But IMHO, unless we spend $70K to buy two (2) NetApp F720s and take advantage of the NetApp's superior failover/load-balancing capability, I'm not sure it's the right move. We still have to have a Linux server around for CVS (and NIS?) anyway, which would put a NetApp at the mercy of the Linux box anyway (any way to get CVS on NetApp?).
I break down my limited knowledge below. I currently have a 100GB Linux 2.2 server (Ext2+NFS2) that I didn't setup (but have heavily reconfigured because the consulting firm didn't know jack crap, but some things like using 1KB block sizes in the /home filesystem that take forever to fsck is something I can't change right now).
We're looking at going to 250GB (possibly 500-1000GB in the next 6 months). Most of our clients are Solaris 2.6 (about a dozen serious production boxes with dual to 6-way processor, 1 to 6GB of RAM, etc...). We are starting to add more and Linux workstations (both dedicated and dual-booting with NT).
Stability: ----------
+++ NetApp (F720) Boots in under 2 minutes due to the use of NVRAM even after an improper shutdown. Buying two (2) units is an extremely viable failover/load-balancing solution. [ Unfortunately, my boss doesn't want to spend more than $30K).
--- Linux (Ext2) Although our Linux systems have averaged >100 days uptime, a routine power outtage without proper shutdown, unable to umount NFS clients, etc... causes an fsck. We figure 30 minutes per 100GB for Ext2 fscks on 8-16KB blocks (currently takes 90 minutes of 1KB blocks, something the consultants before me did -- dumb!). Never had a problem with lost data though.
--- Linux (ReiserFS) Meta-data journaling, but has NFS locking issues (except for a SuSE patch to the old knfsd?). Eliminated from consideration
??? Linux (Ext3) Full data journaling means sync I/O??? What about current stability (I'm using Trond+Higgen 0.21.3 on non-production systems with Solaris 2.6 clients and I'm happy)? Locking with Ext3 (never used Ext3, but many others are now)? I know VA Linux now has Linux 2.2 kernels with Ext3 and NFS3 that are supposively quite stable.
Performance: ------------
??? NetApp (F720 + 1000Base-SX NIC) While the NVRAM in the NetApp allows almost instant sync acknoledgement to clients, the F720 only comes with 1/4th the NVRAM of the F740/760 (8MB versus 32MB). The result is that the F720 can only handle 1/3-1/4th the NFS requests and throughput of the F740/760. I have seen good marks on the later, but cannot find much on the former which makes me wonder if we will run into a performance issue? Transmeta also mentioned that they have had performance issues with the 700-series on high-loads because the on-baord CPU, a single Alpha (usually ~600MHz) cannot even keep up with the number of XORs for just the RAID-4 volume. Of course I only have a dozen NFS clients and Transmeta has many more. Hard drives are of older, 36GB, 7200rpm, Ultra2 (aka Ultra80) SCSI type.
+++ Linux (dual-cpu/memory ServerSet IIIHE chipset + dual-733MHz Pentium III + 233MHz StrongARM-powered, 4-channel, Ultra160 RAID controller on its own PCI 64-bit channel + 1000Base-SX NIC card on it's own 66MHz PCI 64-bit channel) These ServerSet III-powered mainboards seem to best both the older 4-way i450HX EDO mainboards, and the 2-way i840 in just about every memory benchmark. Going to use 2GB of RAM (which is 8x as much as in the F720), and can expand to 8GB later. The 3-independent PCI channels (of which, 2 are 66MHz PCI 64-bit) allows me to put the disk controller and network controller on their own, 64-bit channels. They Mylex 233MHz StrongARM-powered RAID controllers are the fastest you can get for just about anything (including Linux, especially when compared to vendors who still use a 100MHz i960 ASIC) and I'm going to pair it with 73GB, 10000rpm, Ultra160 disks with two disks per channel spread over all 4 channels. On the software side, I've had no issues running async with Linux. I'm more worried about Ext3's full-data journaling being the slouch here. Also wondering how long it takes, exactly, to recover after an improper shutdown with full-data journaling Ext3 (I'm assuming <5 minutes for 250GB?).
[ Further insight wanted: Just wondering if the F720 will be a slouch compared to this? Or will the NVRAM really kick Ext3's butt, even if there is only 8MB of it? ]
Other Considerations: ---------------------
NetApp (Failover Option) Features "ready-to-go" fail-over option via proprietary gigabit links between units. Another unit can even use the disk controller and disks on a failed unit CPU unit. Very nice. Also lowers the admin requirements (like having to keep kernel/apps up-to-date, etc...). Someone on the Linux-NFS list actually mentioned administration wasn't so easy with NetApp (no web-based admin???) and would rather have Linux (I personally couldn't believe this, but she was a seasoned Linux admin like myself).
Linux (Software Support) I still need a Linux box to server as the CVS server, Mail server (including various Sendmail/Mail-based programs like AV scanner, HP OpenMail, Majordomo, GNATS, etc...), NIS server (or does NetApp have NIS server capabilities?), etc... This makes me wonder if I really will "save on TOC" with the NetApp since my Linux admin outside of this has only been about 40-80 hours over the past year (compiling new kernels, installing new NFS, etc...).
Cost: -----
--- NetApp F720 (7 x 36GB disk = 252GB, 216GB usable RAID-4, 180GB if using hot spare) $35K for the configuration (F720 = 600MHz? Alpha, 256MB RAM, 8MB NVRAM) with both SMB and NFS plus the 1000Base-SX upgrade.
+++ Linux (5 x 73GB disk = 365GB, 292GB usable RAID-5, 219GB if using hot spare) $18K for the configuration as above (dual-733MHz, 2GB RAM, RAID controller, 1000Base-SX NIC, etc...) with SAF-TE rackmount disk chassis, ATX rackmount chassis, cabling and redundant power on both the SAF-TE and ATX case.
Thanx in advance ...
-- Bryan "TheBS" Smith "Lead Computer Geek" Theseus Logic, Inc.
-- Bryan "TheBS" Smith CONTACT INFO *********************************************************** Chat: thebs413 @ AOL/MSN/Yahoo (see http://Everybuddy.com) Email: mailto:thebs@theseus.com,b.j.smith@ieee.org Home: http://www.SmithConcepts.com
+++ NetApp (F720) Boots in under 2 minutes due to the use of NVRAM even after an improper shutdown. Buying two (2) units is an extremely viable failover/load-balancing solution. [ Unfortunately, my boss doesn't want to spend more than $30K).
2 minutes is really unlikely unless you're just using vanilla NFS and haven't generated a core. Think more like 5 minutes (still impressive).
--- Linux (Ext2) Although our Linux systems have averaged >100 days uptime, a routine power outtage without proper shutdown, unable to umount NFS clients, etc... causes an fsck. We figure 30 minutes per 100GB for Ext2 fscks on 8-16KB blocks (currently takes 90 minutes of 1KB blocks, something the consultants before me did -- dumb!). Never had a problem with lost data though.
You'll like Netapp then, since you won't have to fsck due to power outages.
--- Linux (ReiserFS) Meta-data journaling, but has NFS locking issues (except for a SuSE patch to the old knfsd?). Eliminated from consideration
Dosn't Ext2 and Ext3 have locking issues too?
??? Linux (Ext3) Full data journaling means sync I/O??? What about current stability (I'm using Trond+Higgen 0.21.3 on non-production systems with Solaris 2.6 clients and I'm happy)? Locking with Ext3 (never used Ext3, but many others are now)? I know VA Linux now has Linux 2.2 kernels with Ext3 and NFS3 that are supposively quite stable.
But Linux's NFS implementation over the years is notoriously poor. Why would you have any faith in it now?? Anyway you get all that data journalling and so on with Netapp.
??? NetApp (F720 + 1000Base-SX NIC) While the NVRAM in the NetApp allows almost instant sync acknoledgement to clients, the F720 only comes with 1/4th the NVRAM of the F740/760 (8MB versus 32MB). The result is that the F720 can only handle 1/3-1/4th the NFS requests and throughput of the F740/760. I have seen good marks on the later, but cannot find much on the former which makes me wonder if we will run into a performance issue?
I think you're a little confused. NVRAM only matters (well, almost only) with disk writes. Your throughput is mainly going to be based on the CPU speed and number of disks, not on NVRAM, because your environment is going to be predominantly read-dominated. NVRAM won't be an issue unless you're constantly writing 4 - 8MB of data every second... are you?
Transmeta also mentioned that they have had performance issues with the 700-series on high-loads because the on-baord CPU, a single Alpha (usually ~600MHz) cannot even keep up with the number of XORs for just the RAID-4 volume.
This doesn't even make any sense to me. Obviously the number of XORs will depend on a lot of things, but name me a server that doesn't have performance issues with high loads?? I can't even understand what they are saying. If they are saying there's not enough CPU for their environment, they can buy a bigger filer or buy a second one and split the load. How is this any different from linux?
Of course I only have a dozen NFS clients and Transmeta has many more. Hard drives are of older, 36GB, 7200rpm, Ultra2 (aka Ultra80) SCSI type.
These issues shouldn't matter to you. What matters is the bottom line in performance, not what components it takes to get there.
+++ Linux (dual-cpu/memory ServerSet IIIHE chipset + dual-733MHz Pentium III + 233MHz StrongARM-powered, 4-channel, Ultra160 RAID controller on its own PCI 64-bit channel + 1000Base-SX NIC card on it's own 66MHz PCI 64-bit channel)
Yes, and you'll need it, because your Linux server is such a poor file server in comparison.
These ServerSet III-powered mainboards seem to best both the older 4-way i450HX EDO mainboards, and the 2-way i840 in just about every memory benchmark. Going to use 2GB of RAM (which is 8x as much as in the F720), and can expand to 8GB later.
Yes, and you'll need it, because your Linux server is such a poor file server in comparison.
The 3-independent PCI channels (of which, 2 are 66MHz PCI 64-bit) allows me to put the disk controller and network controller on their own, 64-bit channels. They Mylex 233MHz StrongARM-powered RAID controllers are the fastest you can get for just about anything (including Linux, especially when compared to vendors who still use a 100MHz i960 ASIC) and I'm going to pair it with 73GB, 10000rpm, Ultra160 disks with two disks per channel spread over all 4 channels.
Yes, and you'll need it, because your Linux server is such a poor file server in comparison.
On the software side, I've had no issues running async with Linux. I'm more worried about Ext3's full-data journaling being the slouch here. Also wondering how long it takes, exactly, to recover after an improper shutdown with full-data journaling Ext3 (I'm assuming <5 minutes for 250GB?).
Dunno.
[ Further insight wanted: Just wondering if the F720 will be a slouch compared to this? Or will the NVRAM really kick Ext3's butt, even if there is only 8MB of it? ]
It isn't so much the NVRAM that does it but the fact that the filer has great software and is tuned only for file service. I'm sure the F720 will compare very well to the Linux box.
NetApp (Failover Option) Features "ready-to-go" fail-over option via proprietary gigabit links between units. Another unit can even use the disk controller and disks on a failed unit CPU unit. Very nice. Also lowers the admin requirements (like having to keep kernel/apps up-to-date, etc...). Someone on the Linux-NFS list actually mentioned administration wasn't so easy with NetApp (no web-based admin???) and would rather have Linux (I personally couldn't believe this, but she was a seasoned Linux admin like myself).
There is web-based admin, but if you're a seasoned Linux admin you'll prefer the prompt. The administration is WAY easier than any UNIX (even Linux) box.
Linux (Software Support) I still need a Linux box to server as the CVS server, Mail server (including various Sendmail/Mail-based programs like AV scanner, HP OpenMail, Majordomo, GNATS, etc...), NIS server (or does NetApp have NIS server capabilities?), etc... This makes me wonder if I really will "save on TOC" with the NetApp since my Linux admin outside of this has only been about 40-80 hours over the past year (compiling new kernels, installing new NFS, etc...).
And this is an even better reason to go with Netapp, because it will be dedicated to doing file service, and thus do it very well. It won't be burdened by all these other tasks better served by a compute server dedicated for that. (PS - Netapp is NIS client only.)
--- NetApp F720 (7 x 36GB disk = 252GB, 216GB usable RAID-4, 180GB if using hot spare) $35K for the configuration (F720 = 600MHz? Alpha, 256MB RAM, 8MB NVRAM) with both SMB and NFS plus the 1000Base-SX upgrade.
You need to reduce the useable space by 10% for filesystem overhead.
+++ Linux (5 x 73GB disk = 365GB, 292GB usable RAID-5, 219GB if using hot spare) $18K for the configuration as above (dual-733MHz, 2GB RAM, RAID controller, 1000Base-SX NIC, etc...) with SAF-TE rackmount disk chassis, ATX rackmount chassis, cabling and redundant power on both the SAF-TE and ATX case.
Again, reduce the useable space by whatever ext3 reserves for filesystem overhead (probably about 10%).
The Netapp is also rackmount and you can get redundant power.
There are so many other things that you didn't mention though:
1. Netapp has Snapshots, which you will fall in love with. 2. You can add new disk easily within minutes, just by slapping in a new drive. You can use disks of different sizes in your RAID array. It is way easier to manage and configure than RAID-5. 3. Integrated NFS/CIFS support (if you buy the CIFS option) with better performance and interoperability (NT ACLs) than running Samba. 4. Performance, performance, performance!
Bruce
2000-07-20-16:25:44 Bruce Sterling Woodcock:
+++ NetApp (F720) Boots in under 2 minutes due to the use of NVRAM even after an improper shutdown. [...]
2 minutes is really unlikely unless you're just using vanilla NFS and haven't generated a core. Think more like 5 minutes (still impressive).
That is such a bummer!
I remember only a couple of years back, the proud boast was that you could kick the plug out of a netapp, and no more than 45 seconds after plugging it back in, it'd be live serving NFS data once again. I even remember a salescritter saying (I think on this list) that they'd had customers take that boast and try and feed it to 'em, with a line like ``Ok, let's try it, if it's not back in 45 seconds by my watch, you're outa here'', and stood by with confidence while the box passed the test.
When did the time increase to 2.6-6.6 times that once-proud 45-second figure?
At 5 minutes, that means Netapps are no longer booting significantly faster than generic servers.
-Bennett
First, let me say that replying to you is a real pain. You should reconfigure your email so that it sends plaintext messages, not putting your message in an attachment. It makes it very difficult to respond to.
When did the time increase to 2.6-6.6 times that once-proud 45-second figure?
At about release 3.0. You have lots more that has to initialize now.... more drives (probably), CIFS, NIS, Java, etc. Of course, the less you have, the faster it will boot.
However, I was specifically talking about in the case of a crash, in which case you have a savecore in there. That takes time.
At 5 minutes, that means Netapps are no longer booting significantly faster than generic servers.
I'd call it significant.
Someone told me recently they tested it out and got some exact times for certain configurations, but I can't find the reference to what they were. They were under 5 minutes IIRC.
Bruce
On Fri, 21 Jul 2000, Bennett Todd wrote:
At 5 minutes, that means Netapps are no longer booting significantly faster than generic servers.
I just tried a reboot on one of our pre-production F740's, and the NFS client (a Solaris machine) saw an NFS outage of 1m16s. Granted, this doesn't take into account a savecore, and I'm not sure how to simulate that. It would be nice to have a firmware setting to disable the boot up memory scan and save another 10 or 15 seconds.
Bruce Sterling Woodcock wrote:
Brian, do you have NIS or CIFS enabled?
I am running with NFS2 (RedHat patched kernel 2.2.16-8) and Samba 2.0.7 (in user security mode, i.e. no NT PDC) on my current Linux box.
-- Bryan
On Sat, 22 Jul 2000, Bruce Sterling Woodcock wrote:
Brian, do you have NIS or CIFS enabled?
No, this is a plain jane filer, not doing DNS resolution, no trunking, etc. Probably the fastest you'll get an F740 to boot.