I'm interested in hearing any stories on how Netapps wins or loses your business. I'm currently shopping for storage and a Netapps filer seems to fit the bill. I've talked to other storage system vendors -- that big one, EMC (or is it Data General), MTI, and some others.
They all emphasize that: - SAN is the future of storage - Netapps and NFS is a lock in to old technology - Netapps is JBOD - that Eurologic is an unknown scrap vendor - Netapps are impossible to back up - RAID 4 is unreliable - WAFL is slow - WAFL is impossible to recover when it becomes corrupt - wack takes all day to report that you're still in trouble - the cluster failover takes too long - the cluster failover is unreliable at best - the Netapps performance is terribly slow - the software is big pile of patches that are impossible to keep track of
They've also given me references of companies that have switched from Netapps to other solutions.
In general three themes came from the references: - they lost data with Netapps. Why? Because they could not take a backup with their enterprise backup software and then they suffered a multiple disk failure from a set of disks with problem firmware.
I've seen a lot of griping on this least lately about backups. How much of a problem is the backup situation and does the situation become a problem when it requires a patch for the Netapps software that may introduce other instabilities?
- the Netapps filer was too slow for database use. I could understand how some of the storage arrays directly attached to a server could be faster. This isn't my application space, so I'm not outright concerned with it.
But, I have seen some of the recent messages on performance and throughput. How would you rate the Netapps for performance for things like home directory storage?
- the clustering did not work. Paid for cluster and the failover did not work as advertised.
This concerns me, because I am willing to pay the extra money for a cluster. As long as I'm shelling out, should I go to EMC for reliability or hope that the Netapps failover can work? I can get more Netapps for the money.
I'm not dumb enough to believe all of the claims that sales droids make. But, I'd like to hear some opinions as to what makes you buy Netapps, what keeps you on Netapps, and what will drive you away from Netapps.
--jeff skelton
"jss" == Jeffrey Skelton jss@NET2PHONE.COM writes:
jss> they lost data with Netapps. Why? Because they could not take a jss> backup with their enterprise backup software and then they jss> suffered a multiple disk failure from a set of disks with problem jss> firmware.
*shrug* Not using anything spiffy here (yet); regular old dump to either a direct attach or over the network. Then again, we don't have umpteen GB of disk either.
jss> But, I have seen some of the recent messages on performance and jss> throughput. How would you rate the Netapps for performance for jss> things like home directory storage?
Pretty well, for me. ~800 clients on a F630 and it hardly breaks a sweat. I find I need more disk space than ops. Our toaster(s) in Build and Test, however, need a truckload of ops (we had the 740 hit ~8100 ops/sec the other day) and not much disk.
jss> - the clustering did not work. Paid for cluster and the failover jss> did not work as advertised.
Heh. I think this is Nathan's cue.
K.
In the immortal words of Kendall Libby (fubar@mathworks.com):
jss> - the clustering did not work. Paid for cluster and the failover jss> did not work as advertised.
Heh. I think this is Nathan's cue.
Eh. I don't really feel like joining in the pile-on here. By and large, I've liked Netapp's products, and their support has generally ranged from the top-notch to the purple-heart territory. (Hi Puneet!) I did my ranting at the time to our sales reps and my beer buddies. :-)
A couple people did ask me for clarification on my earlier, cryptic comments about failover, so here's the reader's digest version: I got bit by the bad FCAL controller problem on the early F760 motherboards. The controller failed during the reconstruction of a failed disk, and the partner head disabled cf takeover. Bad scene, lots of downtime, hovering executives, swarming EMC salesdroids, the works. Moving from the onboard db9 controller to one of the qlogic (?) pci card controllers resolved the issue.
-n
-----------------------------------------------------------memory@blank.org "Hiroshima '45, Chernobyl '86, Windows '95" http://www.blank.org/memory/-----------------------------------------------
On Mon, 9 Aug 1999, Jeffrey Skelton wrote:
- SAN is the future of storage
NetApp uses Fibre Channel so they're definitely ready for SANs from the hardware perspective. A lot of the vendors you mention still clutch onto SCSI which cannot be easily SANable.
- Netapps and NFS is a lock in to old technology
SCSI is far older. The concept of SANs isn't that new either. Think of NACs with NFS as intelligent SAN. Whether the fabric is Fibre Channel (for SAN) or Gigabit Ethernet (for NAS) the underlying technology is the same. There may be a bit more overhead and latency with NAS, but how many viable SAN solutions are really out there today?
- Netapps is JBOD
It's as many Little Old Disks as you want. What advantage do Little Old Disks give you? Isn't SAN a JBOD also, really?
- that Eurologic is an unknown scrap vendor
Granted that the tolerances on their shelves could be a bit better, but I've had no big problems with them. They're a hell of a lot better than the DEC shelves NAC sold before.
- Netapps are impossible to back up
Funny, we back them up every night.
- RAID 4 is unreliable
Nothing is reliable. If you have any RAID set - including mirroring - you kill two disks and you're cooked.
- WAFL is slow
What faster alternative are the other vendors proposing. Considering the fact that one does not have to write any particular block on disk it's very fast on writes. Sequential reads which can - but don't have to - suffer from WAFL can be optimized using caching to probably be as good as ufs.
- WAFL is impossible to recover when it becomes corrupt
It's really no more impossible than recovering from a ufs corruption. It doesn't differ from ufs that much. With NetApps the opportunity for corrupting a file system is significantly smaller than with ufs.
- wack takes all day to report that you're still in trouble
Other FSes are no better in this respect. If you have to wack you're in really deep doodoo, but then the necessity for whacks from my experience is EXTREMELY small.
- the cluster failover takes too long
It takes 1-2 minutes if I recall correctly. What window of outage is acceptable to you? What numbers were the other vendors giving you.
- the cluster failover is unreliable at best
It worked 100% of the time for me so far.
- the Netapps performance is terribly slow
It doesn't seem to me that was, but I haven't tested them for performance. I think the performance is pretty adequate for my requirements.
- the software is big pile of patches that are impossible to keep track of
I don't know about this, I'm not a developer. From my experience there are about 2 patch releases before they become rolled into a recommended release.
In general three themes came from the references:
- they lost data with Netapps. Why? Because they could not take a backup with their enterprise backup software and then they suffered a multiple disk failure from a set of disks with problem firmware.
How old are these claims? The NetApp of today is definitely not the NAC of the past.
How much of a problem is the backup situation and does the situation become a problem when it requires a patch for the Netapps software that may introduce other instabilities?
The backup problems I experienced was the fact that Veritas OpenVision had a problem with the newer version of NetApp OS. The problem was clarified between NetApp and Veritas and it appears that they've worked it out. We're on an older release 5.2.1P2 and backups appear to work just fine. Now that Veritas certified 5.3.2 we will be upgrading to the new version to get advantage of awaited new functionality.
- the Netapps filer was too slow for database use. I could understand how some of the storage arrays directly attached to a server could be faster. This isn't my application space, so I'm not outright concerned with it.
We're not using them for databases currently, so I cannot comment on such use.
But, I have seen some of the recent messages on performance and throughput. How would you rate the Netapps for performance for things like home directory storage?
This is one of our uses. It seems fine.
This concerns me, because I am willing to pay the extra money for a cluster. As long as I'm shelling out, should I go to EMC for reliability or hope that the Netapps failover can work? I can get more Netapps for the money.
NetApp failover does work reliably.
I'm not dumb enough to believe all of the claims that sales droids make. But, I'd like to hear some opinions as to what makes you buy Netapps, what keeps you on Netapps, and what will drive you away from Netapps.
I'll break the answer into several sections.
What I like about NetApps:
- They're simple to configure and administer. - They allow you to set soft partitions (quotas) to section up large volumes. - You can resize soft partitions up or down as you please. - You can grow "hard" volumes by simply adding disks. - The filesystem design is simple and robust. - The hardware design is clean and uniform (compare this to Auspex which is a mess composed of UltraSparcs, Pentium IIs, and 386s). - Clustering is simple and it works. - Upgrading the OS is a breeze (one simple reboot). - There is a ton of documentation available on-line explaining different implementational ideas for various environments as well as explanations of architectural features of NetApps.
What I dislike about NetApps (Some of these might have been fixed in 5.3.x):
- Lack of several - most are there - simple commands to test whether everything outside works (there are some hidden NIS commands, but they don't work anymore). - The inability to change passwords via rsh or a more secure yet "automatable" manner - The inability to kick off a user from a telnet session with ease. - Lack of a rudimentary editor to edit configuration files so that an administrative host is largely unnecessary (a simple vi will do). - The inability to add quotas without stopping and starting the quota system (You can modify sizes of existing quotas, but cannot add quota restrictions without stopping and starting the quota system). - The lack of NTP daemon to synchronize the system with the rest of the network. - The lack of round robin Ethernet trunking support (Both of our switch vendors Cisco and Cabletron support it. ***This is the largest NetApp performance issue for us.*** - Poor quality cable connectors. They are functionally great, but the thumbscrews need some help. In the past, the rubber molding on the screws would be stripped when trying to gently tigten them with a small screwdriver. On my newest 760s I broke three screws because of a faulty tap nut. The nut didn't allow the screw to come flush with the connector and the screws themself were so brittle that they broke off. I LOVE the new screw design, but they MUST be made from a more durable material. - Tolerances on the disk shelves could be better. The golden material used as an isolator between the disks and the case should be attached more uniformly. Sometimes it is difficult to insert a disk without it becomming caught on the said isolator. Also, there is a slim piece of plastic protruding from the bottom and against the side of the shelf. I found that piece of plastic to often be bent down or completely broken off. When bent down the strip prevents the drive from being inserted. In my humble opinion this strip should be removed altogether. It is to thin and too flexible to stay up to spec.
Problems I have with Technical Support:
- Appearance of no clear and methodical path for troubleshooting problems on site. I've had this experience once. Two technicians appeared in my computer room after repetitive nagging for them to show up and take a look at my filer which was spontaneously rebooting. Initially they appeared to be as stunned as I was. After some time, they proceeded to test the system drive shelf by drive shelf. I should mention that this Filer was a part of a cluster and in order to test the drive shelves had to be disconnected from the clustering partner and thus interrupt my users' access to data. Had they not troubleshot the box the data would have been accessible throughout the day. Because of troubleshooting the data were inaccessible during business hours. In retrospect I regret letting them in when I did. - The Tech Support lines have ungodly waiting times (before with annoying music). I doubt that I am exaggerating when I say that I waited for half an hour last time I called Tech Support. I'm sure NetApp has a better measure of how much time I waited for anyone to answer during business hours. I'd like to have it so I can quote it exactly. - The Tech Support bumps you from tech to tech over several days and poorly communicates their discoveries, i.e. they rarely communicate anything and when they do I can see my grandchildren watch their granchildren graduate with PhDs. - The Tech Support sends replacement parts with no explanation. I received parts from NAC on several occasions very timely, but often unnecessarily and with no communication. These parts did not come as a result of requesting such parts, but as outcome of troubleshooting the problem based on submitted data without discussion with the customer. You submit a ticket describing a problem and the parts simply appear. I call NAC asking them why I should replace the parts only to hear that what I described is not a hardware problem. Several weeks later I receive reassurance that it is in fact an exotic bug that is fixed in the recent OS release. - The Tech Support doesn't keep a good track of materials they receive from the customers. It seems like every time I call to find out about my tickets I have to resubmit autosupport and potential cores. Can't they just take one of the boxes from the production lines, dump all that data into it, and then reference the location in the ticket database? I'm sick of uploading up to 80MB core files again and again and again.
Regardless of the above harsh criticism, I think that the NACs are overall very good. They're stable of themselves and clustering works if you need to perform any maintenance on the head unit itself. The technical support organization needs work, but I think they realize it and are moving in a good direction. My gut feeling is that they're becomming more expert on their product, but you still have to wait 30 minutes to get through to them. I'd recommend them over Auspex for their simple and robust OS, clearcut hardware design, ease of use, and speed of cleanup after a crash. With NetApps I had no cleanups; yank a cord, plug it back in, and it continues chuging within 2 minutes. Try that with an Auspex. Similar is true with many directly attached devices.
Tom
On Mon, Aug 09, 1999 at 10:15:59PM -0500, tkaczma@gryf.net wrote:
What I dislike about NetApps (Some of these might have been fixed in 5.3.x):
- The inability to kick off a user from a telnet session with ease.
This bugs me a lot
- Lack of a rudimentary editor to edit configuration files so that an administrative host is largely unnecessary (a simple vi will do).
This bugs me too!
- The inability to add quotas without stopping and starting the quota system (You can modify sizes of existing quotas, but cannot add quota restrictions without stopping and starting the quota system).
This one I don't get.
I currently add people to our system and don't see the problem you describe.
- The lack of NTP daemon to synchronize the system with the rest of the network.
Another annoyance.
I'd recommend them over Auspex for their simple and robust OS, clearcut hardware design, ease of use, and speed of cleanup after a crash. With NetApps I had no cleanups; yank a cord, plug it back in, and it continues chuging within 2 minutes. Try that with an Auspex. Similar is true with many directly attached devices.
Care to give more detail to this?
I ask because Auspex was just at my office touting their new NS2000 at a price point that I haven't seen before.
On Tue, 10 Aug 1999, Mike Horwath wrote:
- The inability to add quotas without stopping and starting the quota system (You can modify sizes of existing quotas, but cannot add quota restrictions without stopping and starting the quota system).
This one I don't get.
I currently add people to our system and don't see the problem you describe.
Really, what is your procedure for adding new lines to the quotas file? More specifically, what do you do after you add lines to the quotas file? Which rev of DOT do you use?
I'd recommend them over Auspex for their simple and robust OS, clearcut hardware design, ease of use, and speed of cleanup after a crash. With NetApps I had no cleanups; yank a cord, plug it back in, and it continues chuging within 2 minutes. Try that with an Auspex. Similar is true with many directly attached devices.
Care to give more detail to this?
I ask because Auspex was just at my office touting their new NS2000 at a price point that I haven't seen before.
OK, here come the rants.
The NS2000 is composed of at least 4 processors. The Host Processor is built using an UltraSPARC (~300 MHz). In my opinion this piece is completely unnecessary as it provides marginal functionality to an appliance that should serve NFS. The OS running on this piece of hardware is hacked up Solaris 2.6. Then there are Network/File System Processors. These are based on 2 Pentium II's and run whatever OS Auspex managed to write or hack for them. Thirdly, there is the Environmental Monitoring unit based on a 386. Obviously that runs yet another OS, probably very simple and firmware based. What next? Oh, Auspex uses SCSI controllers (this is bad enough in my opinion) that have three SCSI channels. The way they would like to connect their shelves is 1 channel per 7 drives, but their shelf configuration holds 28 drives so you end up connecting the first channel to two shelf units (14 drives) and the remaining two to the remaining 14 at 7 a piece. What a hack, really.
A nasty reboot (power outage) involves rebooting the Host Processor which then uploads the proprietary OS to the NPFSP (Network P./ Filesystem P.). Then each NPFSP _must_ perform a fsck. Oh, I forgot to mention. Their log filesystem only logs metadata and their snapshots are not done in place like they are on a NetApp but copied out of the filesystem. A NetApp allocates another block on write and updates the inode; an Auspex copies the old block to a separate partition and then overwrites the original block. That's two writes in contrast to one. Sure Auspex direct disk reads may be faster, but they already cache about a quarter GB of data in memory so you get no benefit from it with volatile filesystems. The NetApp has enough time to read ahead blocks that are not contiguous, they also have an opportunity to place new blocks close to the rest of the file; WAFL means write anywhere, that includes writing near the old blocks, something I'm certain NetApps attempt to do in their block placement algos. In short I find the Auspex design messy and kludgy. The benefits, if any, appear to be miniml and are outweighted by increased opportunity for things to break (The more things there are to break in series the more likely the system is to break).
Tom
On Tue, Aug 10, 1999 at 11:10:28AM -0500, tkaczma@gryf.net wrote:
On Tue, 10 Aug 1999, Mike Horwath wrote:
- The inability to add quotas without stopping and starting the quota system (You can modify sizes of existing quotas, but cannot add quota restrictions without stopping and starting the quota system).
This one I don't get.
I currently add people to our system and don't see the problem you describe.
Really, what is your procedure for adding new lines to the quotas file? More specifically, what do you do after you add lines to the quotas file?
rsh <servername> quota resize <vol>
Which rev of DOT do you use?
NetApp Release 5.1.2: Sat Sep 26 05:05:59 PDT 1998
Yep, somewhat out of date, but works dandy.
Testing right now!
SO, it fails on butler, which is the same revision.
And checking on jeeves...it also fails Too!
I guess I get to eat crow on this one. :(
A benefit I have, though, is that I don't do day-to-day maintenance and I guess I was wrong all along.
Sorry 'bout that!
I'd recommend them over Auspex for their simple and robust OS, clearcut hardware design, ease of use, and speed of cleanup after a crash. With NetApps I had no cleanups; yank a cord, plug it back in, and it continues chuging within 2 minutes. Try that with an Auspex. Similar is true with many directly attached devices.
Care to give more detail to this?
I ask because Auspex was just at my office touting their new NS2000 at a price point that I haven't seen before.
OK, here come the rants.
Excellent, thanks!
Got more details?
I will be discussing these aspects with them, and testing it out myself.
The price per GB is looking oh so much better than the NetApp.
On Tue, 10 Aug 1999, Mike Horwath wrote:
I will be discussing these aspects with them, and testing it out myself.
The price per GB is looking oh so much better than the NetApp.
I think they cut their pricing quite significantly in the NS2000. Ideologically I'd probably rather buy directly attached storage than what appears to be a bunch of hardware patches. Initially their architecture had even more components. They decided to consolidate more and more in the new hardware revs, but the NAC is already consolidated. Judging by the architecture, I think that NACs are cheaper to produce than Auspex and better thought out. I think the difference in pricing is because Auspex is trying to recapture some market which is divided by far more companies than a reader of this alone would be led to believe.
Tom
On Tue, 10 Aug 1999, Mike Horwath wrote:
Care to give more detail to this?
I ask because Auspex was just at my office touting their new NS2000 at a price point that I haven't seen before.
One other thing that came to my mind is that Auspex does not support CIFS and HTTP. I believe they do FTP. I'm not interested in CIFS or HTTP myself as I use Filers with NFS exclusively, but I am glad that I have the option of turning on CIFS when my organization decides to use CIFS directly from storage appliances. You may say that the presence of the host processor on an Auspex enables one to do a variety of things like run Samba, but I assure you that Auspex is not optimized to perform any major file operations via its host processor. Also, if you think that you can rdist something quickly to an Auspex box you're in for a disappointment. Both NAC and Auspex are built with a small set of file services (NFS, etc.) in mind. If you don't go through these services in case of Auspex you suffer severe performance penalties. With NetApp this is not a problem as there is no way circumventing those services aside from their proprietary volume copy mechanism which is very robust itself.
Tom
On Tue, Aug 10, 1999 at 11:22:08AM -0500, tkaczma@gryf.net wrote:
On Tue, 10 Aug 1999, Mike Horwath wrote:
Care to give more detail to this?
I ask because Auspex was just at my office touting their new NS2000 at a price point that I haven't seen before.
One other thing that came to my mind is that Auspex does not support CIFS and HTTP. I believe they do FTP. I'm not interested in CIFS or HTTP myself as I use Filers with NFS exclusively, but I am glad that I have the option of turning on CIFS when my organization decides to use CIFS directly from storage appliances. You may say that the presence of the host processor on an Auspex enables one to do a variety of things like run Samba, but I assure you that Auspex is not optimized to perform any major file operations via its host processor. Also, if you think that you can rdist something quickly to an Auspex box you're in for a disappointment. Both NAC and Auspex are built with a small set of file services (NFS, etc.) in mind. If you don't go through these services in case of Auspex you suffer severe performance penalties. With NetApp this is not a problem as there is no way circumventing those services aside from their proprietary volume copy mechanism which is very robust itself.
Tom, when was the last time you looked at an Auspex?
CIFS is supported. (I also only use NFS :)
On Tue, 10 Aug 1999, Mike Horwath wrote:
Tom, when was the last time you looked at an Auspex?
CIFS is supported. (I also only use NFS :)
Just a couple minutes ago. ;) We have a NS2000, but it's not in production yet. I don't think the older (7000?) series support CIFS. I just read a press release dated 7/29/99 that states NT support on the NS2000. Sorry for misleading you.
Tom
tkaczma@gryf.net writes:
What I dislike about NetApps (Some of these might have been fixed in 5.3.x): [...]
- The lack of NTP daemon to synchronize the system with the rest of the network.
That was added in 5.3. This was Most Excellent, because my nightly cron job to 'rsh FILER rdate UNIXBOX' was a hack that slightly annoyed me :-)
Here's a couple of commands (with hostnames sanitized).
# date ; rsh FILER date Wed Aug 11 22:22:45 EST 1999 Wed Aug 11 22:22:45 EST 1999
# rsh FILER options | grep timed timed.enable on timed.log off timed.max_skew 30m timed.proto ntp timed.sched hourly timed.servers NTPSERVER1-IP,NTPSERVER2-IP
- The lack of NTP daemon to synchronize the system with the rest of the network.
That was added in 5.3. This was Most Excellent, because my nightly cron
Oh my! After 3+ years agonizing wait they finally did it!!! Now, only if 5.3 is stable enough for us to move to :-)
-- Begin original message --
From: Luke Mewburn lukem@goanna.cs.rmit.edu.au Date: Wed, 11 Aug 1999 22:31:32 +1000 Subject: Re: How does Netapps win and lose business? To: tkaczma@gryf.net Cc: toasters@mathworks.com
tkaczma@gryf.net writes:
What I dislike about NetApps (Some of these might have been fixed in 5.3.x): [...]
- The lack of NTP daemon to synchronize the system with the rest of the network.
That was added in 5.3. This was Most Excellent, because my nightly cron job to 'rsh FILER rdate UNIXBOX' was a hack that slightly annoyed me :-)
Here's a couple of commands (with hostnames sanitized).
# date ; rsh FILER date Wed Aug 11 22:22:45 EST 1999 Wed Aug 11 22:22:45 EST 1999
# rsh FILER options | grep timed timed.enable on timed.log off timed.max_skew 30m timed.proto ntp timed.sched hourly timed.servers NTPSERVER1-IP,NTPSERVER2-IP
-- End original message --
---philip thomas
Jeffrey Skelton wrote:
I'm interested in hearing any stories on how Netapps wins or loses your business. I'm currently shopping for storage and a Netapps filer seems to fit the bill. I've talked to other storage system vendors -- that big one, EMC (or is it Data General), MTI, and some others.
They all emphasize that:
- SAN is the future of storage
This may be true but not written in stone, yet.
- Netapps and NFS is a lock in to old technology
But Netapp is just more than NFS. Try running CIFS and NFS on the same volume with an EMC.
- Netapps is JBOD
This person knows nothing of which he speeks and should be duely ignored for such an ignorant statement. If you wish to purchase product from a vendor that would flat out lie to you, be my guest.
- that Eurologic is an unknown scrap vendor
Eurologic joined SNIA this year to help drive the direction of SAN along with Netapp, EMC and Auspex. Erologic has also partnered with Seagate and DEC. Check www.erologic.com for more information on this company.
- Netapps are impossible to back up
Not so. There are many ways to backup a filer. The fasted by far is direct attached SCSI though. If investing in large filers, one might consider the use of DLT for backups.
- RAID 4 is unreliable
How much more reliable do you want to be? With a dedicated parity disk and hot spares available, the risk of loosing two disks at on the same volume at the same time is minimal. This has happened. It happened to me. It was fully detectable but I missed the warning sign becuase I became busy with other things.
- WAFL is slow
Most vendors pit their numbers for non-RAIDed disk against those of Netapps RAIDed disks and often lose. RAID for RAID, Netapp competes with the best of them.
- WAFL is impossible to recover when it becomes corrupt
Corruption is very rare because of the way Netapp "scrubs" this disks. In the event of corruption, the SE does have utilities avilable to him to try to fix the problem. I screwed up an OS down-grade once and had to call upon Netapp to bail me out. Giving the error messages over the phone to the support center, I was given instructions on what to do to resolve my problem. I was back in business before morning.
- wack takes all day to report that you're still in trouble
Not true. With larger file systems it does take longer, but one can run WACK several times in the same day to resolve these problems.
- the cluster failover takes too long
0:02:45. Timed it.
- the cluster failover is unreliable at best
It is true the there were some issues in the beginning. With the release of DOT 5.3.x, and the use of 18GB drives, I have not seen the same problems re-occur.
- the Netapps performance is terribly slow
We covered this already, I think.
- the software is big pile of patches that are impossible to keep track of
True. DOT is currently a big pile of patches that is difficult, at best, to track. Network Appliance has heard this complaint and is dealing with the situation.
They've also given me references of companies that have switched from Netapps to other solutions.
In general three themes came from the references:
they lost data with Netapps. Why? Because they could not take a backup with their enterprise backup software and then they suffered a multiple disk failure from a set of disks with problem firmware.
I've seen a lot of griping on this least lately about backups. How much of a problem is the backup situation and does the situation become a problem when it requires a patch for the Netapps software that may introduce other instabilities?
the Netapps filer was too slow for database use. I could understand how some of the storage arrays directly attached to a server could be faster. This isn't my application space, so I'm not outright concerned with it.
But, I have seen some of the recent messages on performance and throughput. How would you rate the Netapps for performance for things like home directory storage?
We are using F760s for home directory storage. There is no problem with this as a solution. Depending on the database application and the filer model, there could be a performance issue. I would not recomend using an F330 for large Oracle data structures. One realy needs to focus on Highend filers for large or very busy data bases.
the clustering did not work. Paid for cluster and the failover did not work as advertised.
This concerns me, because I am willing to pay the extra money for a cluster. As long as I'm shelling out, should I go to EMC for reliability or hope that the Netapps failover can work? I can get more Netapps for the money.
There are/were problems with the cluster failover process in the begining of DOT 5.2. There were situations where a failover would be initiated and the partner would panic. Most of these issues were addressed. Since installing DOT 5.3.x on our filers, we have not seen a partner panic situation occur.
I'm not dumb enough to believe all of the claims that sales droids make. But, I'd like to hear some opinions as to what makes you buy Netapps, what keeps you on Netapps, and what will drive you away from Netapps.
Network Appliance performs for us. We are a Engineering Design group. We moved to a dataless desktop environment for adminstrative reasons. All of our current storage is NetApp. Home directories, design directoryies and the very abusive simulation directories. We have a bank of LFS servers, U2s with multiple CPUs. The sole purpose of these systems is to run simulations of our designs. This bank of LFS servers brought, to its knees, our Auspex servers running over FDDI. We have not seen an F760 filer become bogged from our LFS servers, even when we were running in Fail-Over mode. The single head was able to keep up with the demand for both simulation filers.
We also make use of the snapshots that NetApp has. We use different schedules for different applications. Design data being our most important is snapshoted more often then home directories. Simulation do not receive snopshots at all.
If there is a problem, a call to 888-4-NETAPP, is usally all it takes to get the situation resolved. Sometimes the problem is more complex than can be handeled over the phone and an SE is dispatched. I like to do my own work and thus I like talking to the engineering staff in California to resolve my own problems. If I have a problem with getting support, I can call on my SE or Account Exec. to give support a push. I have had the same good relationship with Auspex as well though.
Network Appliance is customer oriented and I hope they stay that way. This is why I continue to recomend them to my management. By the way Jeff, if you buy a NetApp, tell Tom I want a part of the commission. :)
-gdg
--jeff skelton
Jeff, Glen,
here at TI Europe we're also using NetApp Filers (740, 630, 330 in Germany, 720 in Northampton, 5xx in Stockholm).
Let me give a few comments:
They all emphasize that:
- SAN is the future of storage
This may be true but not written in stone, yet.
For a distributed computing environment NFS is still the only state-of-the-art choice for file access. In an engineering environment with many user workstations accessing a few fileservers and some more computeservers you need a filesystem oriented protocol like NFS - there is currently no other standardized way to do it.
SAN solutions make sense i.e. for applcation and database servers accessing different volumes on one SAN storage system, but, in our case it really doesn't make sense to access each workstation with a separate filer line instead of using the existing IP network.
What SAN vendors do is saling separate fileservers on top of the SAN architecture - comparable to EMC's Celerra. But also these fileservers will use NFS.
- Netapps and NFS is a lock in to old technology
But Netapp is just more than NFS. Try running CIFS and NFS on the same volume with an EMC.
With the Celerra it shouldn't be a problem...
- Netapps are impossible to back up
Not so. There are many ways to backup a filer. The fasted by far is direct attached SCSI though. If investing in large filers, one might consider the use of DLT for backups.
In a larger environment you probably don't want to backup each fileserver separately with an attached DLT. And as NetApp doesn't support a reasonable NDMP version which allows backup over the net the only way to do a central backup is through NFS - not a very smart solution.
- RAID 4 is unreliable
How much more reliable do you want to be? With a dedicated parity disk and hot spares available, the risk of loosing two disks at on the same volume at the same time is minimal. This has happened. It happened to me. It was fully detectable but I missed the warning sign becuase I became busy with other things.
- WAFL is slow
Most vendors pit their numbers for non-RAIDed disk against those of Netapps RAIDed disks and often lose. RAID for RAID, Netapp competes with the best of them.
I think RAID4/RAID5 is cannot be as performant as RAID1 (Mirror), because the parity bits need to be recalculated for every bit written. But, NetApp did a good job in optimizing their systems for RAID4, so the F760 i.e. is still one of the fastest dedicated file servers - see http://open.specbench.org/osg/sfs/results/ for a more detailed information.
If there is a problem, a call to 888-4-NETAPP, is usally all it takes to get the situation resolved. Sometimes the problem is more complex than can be handeled over the phone and an SE is dispatched. I like to do my own work and thus I like talking to the engineering staff in California to resolve my own problems. If I have a problem with getting support, I can call on my SE or Account Exec. to give support a push. I have had the same good relationship with Auspex as well though.
Network Appliance is customer oriented and I hope they stay that way. This is why I continue to recomend them to my management. By the way
With the above I've my problems. At least Network Appliance Germany is NOT very customer oriented but very sales oriented. It took about 4 months until our F330 which was sporadicly rebooting ~once the week was exchanged and we had to go until the highest management level to get the exchange. The suggested solution before the exchange was "buy a new system, the F330 is outdated" or "your network or UPS may be responsible for that, so get it fixed." In other words, and not only in this case, the European NetApp support is NOT satisfying - the Auspex support is much better.
And here we're at the second problem. NetApp's servers are not expensive, but you'll see this with the H/W. In the last year we had
- 1 defect SCSI card - 2 defect FCAL cards - 4 defect network cards (the last one, a GigaBit card, was dead on arrival) - a few defect disks - a defective power supply (fan died)
To be prepared for those problems you'll have to BUY a so called "spares kit" with all possibles spare parts for a lot of money.
Additional, in a few cases a NetApp died without a reboot or any diagnostic result - just reboot it manually, and it works again... ... and in a few cases a disk died but no automatic recontruct started. NetApp didn't find out why - I think they didn't investigate very hard.
So, in a conclusion, you can't bypass NFS, NetApps NFS servers are performant, not expensive NFS/CIFS servers, but not very reliable. No chance for 99% uptime.
Best regards,
Alexander Strauss wrote:
Jeff, Glen,
- Netapps are impossible to back up
Not so. There are many ways to backup a filer. The fasted by far is direct attached SCSI though. If investing in large filers, one might consider the use of DLT for backups.
In a larger environment you probably don't want to backup each fileserver separately with an attached DLT. And as NetApp doesn't support a reasonable NDMP version which allows backup over the net the only way to do a central backup is through NFS - not a very smart solution.
backing up large volumes across an NFS mount is highly inefficient. We are using a Storage Tech tape silo and Veritas backup solution. With the large tape system, we are able to directly connect DLT 7000 to the filers. When implementing high end storage solutions one must also implement high end backup solutions.
-gdg
G D Geen wrote:
Alexander Strauss wrote:
Jeff, Glen,
- Netapps are impossible to back up
Not so. There are many ways to backup a filer. The fasted by far is direct attached SCSI though. If investing in large filers, one might consider the use of DLT for backups.
In a larger environment you probably don't want to backup each fileserver separately with an attached DLT. And as NetApp doesn't support a reasonable NDMP version which allows backup over the net the only way to do a central backup is through NFS - not a very smart solution.
backing up large volumes across an NFS mount is highly inefficient. We are using a Storage Tech tape silo and Veritas backup solution. With the large tape system, we are able to directly connect DLT 7000 to the filers. When implementing high end storage solutions one must also implement high end backup solutions.
Correctly. We didn't consider to attach our robot directly to the NetApps because of the following reasons:
- central backup (AML/J w/700DLT slots) of a distributed server environment does not allow direct SCSI cabling.
- hard backup drive assignment (vs. automatic, S/W controlled drive assignement) is not efficient and thus not cost effective.
- desaster recovery capability (strategic TI goal) does not allow to have the robot in the same room as the fileservers - and SCSI lines are limited.
So we desided to do a NFS backup of the NetApps until (hopefully) a new NDMP client will allow a reasonable backup over the network. By the way, in another mail I got the answer, that backups to a remote drive are much faster - we want to do this in a similar way once the NDMP client is ready...
Interestingly, there was brief discussion on toasters a few weeks ago about slow backups to a local SCSI tape, when the drive was made remote the backups went MUCH faster. It seems that the additional buffering provided by the network allowed the tape drive to stream better than when it was attached locally.
On Tue, 10 Aug 1999, Alexander Strauss wrote:
In a larger environment you probably don't want to backup each fileserver separately with an attached DLT. And as NetApp doesn't support a reasonable NDMP version which allows backup over the net the only way to do a central backup is through NFS - not a very smart solution.
What do you consider a larger environemnt? How many terabytes of data do you have?
I think RAID4/RAID5 is cannot be as performant as RAID1 (Mirror), because the parity bits need to be recalculated for every bit written.
Addition is cheap, so that is not an issue. What may be an issue is retrieving parity data from the array, but that can easily be cached. There are two writes in RAID 4/5 just as there are in RAID 1.
Tom
I think RAID4/RAID5 is cannot be as performant as RAID1 (Mirror), because the parity bits need to be recalculated for every bit written.
Addition is cheap, so that is not an issue. What may be an issue is retrieving parity data from the array, but that can easily be cached. There are two writes in RAID 4/5 just as there are in RAID 1.
Actually, if you write N data blocks that correspond to the same parity block, you only do N+1 writes for the N blocks.
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
On Tue, 10 Aug 1999, Stephen C. Losen wrote:
Actually, if you write N data blocks that correspond to the same parity block, you only do N+1 writes for the N blocks.
That is certainly an optimization. In the most basic case you write one block and update the parity block, which is two writes. One can also read all the blocks that add up to the parity block and recheck, but I wonder what that would do to performance. BTW, as I hear, Auspex replaces bad parity with a value calculated from the component blocks. I think that is dangerous.
Tom
On Tue, 10 Aug 1999, Stephen C. Losen wrote:
Actually, if you write N data blocks that correspond to the same parity block, you only do N+1 writes for the N blocks.
That is certainly an optimization. In the most basic case you write one block and update the parity block, which is two writes. One can also read all the blocks that add up to the parity block and recheck, but I wonder what that would do to performance. BTW, as I hear, Auspex replaces bad parity with a value calculated from the component blocks. I think that is dangerous.
In WAFL's RAID4, the parity block is the XOR of all the corresponding data blocks. Assuming the old parity block is correct, to calculate the new parity block, you don't need to know the values of all the data blocks. you only need to know the old and new values of the modified data blocks and the old value of the parity block.
If a bit changes in a data block, the corresponding parity bit must change. If the corresponding bit changes in two data blocks, it does not change in the parity block, etc. So you just XOR the old data block values and the new data block values and the old parity block to get the new parity block.
So if less than half of a parity block's data blocks change, it is faster to "update" the parity block. Otherwise it is faster to recalculate the parity block from the new data blocks. The choice also probably depends on which blocks are currently cached in RAM. And if WAFL zeroes out a block whenever it is freed, then whenever WAFL allocates a free block, the old value is known to be all zeroes, so the block does not need to be read from disk. And since I XOR 0 == I, a null block can be omitted from any parity calculations.
I'm no WAFL expert -- just a customer who's read the white papers. But it is clear that there are a lot of tricks that WAFL can employ to make RAID4 parity calculations fast and efficient.
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
So if less than half of a parity block's data blocks change, it is faster to "update" the parity block. Otherwise it is faster to recalculate the parity block from the new data blocks.
Yup, we call those "parity by subtraction" and "parity by recalculation"; I don't know if we mention that in any white papers, but we do, in fact, we choose one or the other based on which requires fewer disk accesses.
The choice also probably depends on which blocks are currently cached in RAM.
Nope - the RAID code (unless I missed something in my check) doesn't check the WAFL buffer pool to see which blocks are cached but aren't being written.
And if WAFL zeroes out a block whenever it is freed,
It doesn't, so that particular speedup won't work.
tkaczma@gryf.net wrote:
On Tue, 10 Aug 1999, Alexander Strauss wrote:
In a larger environment you probably don't want to backup each fileserver separately with an attached DLT. And as NetApp doesn't support a reasonable NDMP version which allows backup over the net the only way to do a central backup is through NFS - not a very smart solution.
What do you consider a larger environemnt? How many terabytes of data do you have?
About 4 TB only here in Germany - about 1/4 on NetApp.
I think RAID4/RAID5 is cannot be as performant as RAID1 (Mirror), because the parity bits need to be recalculated for every bit written.
Addition is cheap, so that is not an issue. What may be an issue is retrieving parity data from the array, but that can easily be cached. There are two writes in RAID 4/5 just as there are in RAID 1.
Tom
On Wed, 11 Aug 1999, Alexander Strauss wrote:
About 4 TB only here in Germany - about 1/4 on NetApp.
I have half that on a single NetApp in production. I don't know of any backup problems due to the amount of data. We use Breece Hill tape libs. I'll have roughly 3 times that on two new NetApps by the end of next week if the networking gods bless my endeavour. I'll see how that goes.
Tom
On Wed, 11 Aug 1999, Alexander Strauss wrote:
About 4 TB only here in Germany - about 1/4 on NetApp.
I have half that on a single NetApp in production. I don't know of any backup problems due to the amount of data. We use Breece Hill tape libs. I'll have roughly 3 times that on two new NetApps by the end of next week if the networking gods bless my endeavour. I'll see how that goes.
Tom
I think the volume of data is straying off the point that Alexander was originally trying to make. I think the point was that if you want less than one tape library per filer, things get difficult.
If that is what was being said I agree. It is a problem that I have found across the board while trying to spec our backup solution. If it was not what was being said, then I am saying it now.
Lewis
On Thu, 12 Aug 1999, Lewis wrote:
If that is what was being said I agree. It is a problem that I have found across the board while trying to spec our backup solution. If it was not what was being said, then I am saying it now.
Yes, I agree. What is the problem with backing up the systems via dump or NFS? I'm neglecting the fact that you loose CIFS and quota trees.
Tom