On 2023-10-17 14:54, Florian Schmid via Toasters wrote:
Hi Johan, thank you very much for your help.
No, we don't have the disks yet for which the flash-pool should be used. Not all SSDs will be used for flash-pool, only some for cache and the rest for fast SSD storage.
So you're thinking to have several different "physical tiers" with different characteristics (performance, inherent latency) for different workloads -- in the same HA-pair? Several different Aggrs with differing performance and behaviour in the same node, FAS8300? (it's a fairly powerful machine so it can do this adequately in many smaller workload cases).
Or do you mean in different 8300 nodes in a X-node cluster (what's X?)
This idea is much harder to make successful than you probably think. It requires you to know very much about your workloads, your applications, what they do so that you can place the correct data in the right place and you have to have the ability to do this over time as data volumes grow. Assuming they do... It's very hard indeed to automate so you need people who can baby watch this continuously and move data around. Yes that's mostly non-disruptive, but it's still quite a lot of work.
It also pretty much assumes for it to be successful in the longer run that your applications do not change their workload patterns and/or pressure more than very slowly. Is this the case?
All in all FabricPool is much much more automatic. It just does the job itself, pretty much w/o fuss once you've tuned it a bit w.r.t. cool down period(s) and things. It "just works". You do need an S3 target system, but as has already been pointed out it can be ONTAP with NL-SAS drives, if you already have a bunch of these lying about you can repurpose those and instead use new Cx00 (or Ax00) nodes in the "front end". The challenge with FabricPool is the network: the connection between the front end and the S3 back end needs to be very good and solid. You have to understand it fully and know every detail how it's built so you know you can trust it's capacity and latency; traffic can be quite bursty.
I'm not very positive to your idea here I'm afraid:
"Not all SSDs will be used for flash-pool, only some for cache and the rest for fast SSD storage."
it's just my (long) experience of this that it's not very productive in reality and it costs a lot of operations (manual work, skilled personnel). It also tends to give you various problems when you need to do HW LCM (upgrade your controllers and disk back ends). It inevitably leads to stranded capacity in more than once dimension as time passes.
/M
Hi Michael,
wow, thank you very much for your time writing this very detailed explanation!
It will be one 2-node 8300 cluster, switchless. The cluster will be mainly used for long time archive storage until it is going to tape or for tape restores. For this, we want to take a huge amount of NL-SAS drives.
I thought for speeding the NL-SAS aggregates a little bit up, we also use some SSDs as flash-pool, like we have it now on our old dev NetApp cluster.
The other SSDs will be used for backup and DR purposes. We have a full production all-flash cluster for our normal workloads.
We think about moving some data to the 8300 cluster in the long term, because not all volumes we have now on SSD must be on flash and might consume there too much "expensive" space.
I will also have a deeper look on fabricpool. I had a look already on it in the past, but as I read S3 storage, I haven't looked deeper into it, as we are not using S3 at all in the moment. This was some years ago. As we always had only one all-flash cluster, I haven't thought about it.
Should fabricpool not also work on a 2-node cluster? So instead of using some SSDs for flashpool, we could create an aggregate on SSD and one on NL-SAS and use the NL-SAS one for S3 storage and then for local fabricpool?
Best regards, Florian
----- Ursprüngliche Mail ----- Von: "Michael Bergman" michael.bergman@norsborg.net An: "Toasters" toasters@teaparty.net Gesendet: Dienstag, 17. Oktober 2023 15:37:35 Betreff: Re: Question about flash pool maximum SSD size and local tiering
On 2023-10-17 14:54, Florian Schmid via Toasters wrote:
Hi Johan, thank you very much for your help.
No, we don't have the disks yet for which the flash-pool should be used. Not all SSDs will be used for flash-pool, only some for cache and the rest for fast SSD storage.
So you're thinking to have several different "physical tiers" with different characteristics (performance, inherent latency) for different workloads -- in the same HA-pair? Several different Aggrs with differing performance and behaviour in the same node, FAS8300? (it's a fairly powerful machine so it can do this adequately in many smaller workload cases).
Or do you mean in different 8300 nodes in a X-node cluster (what's X?)
This idea is much harder to make successful than you probably think. It requires you to know very much about your workloads, your applications, what they do so that you can place the correct data in the right place and you have to have the ability to do this over time as data volumes grow. Assuming they do... It's very hard indeed to automate so you need people who can baby watch this continuously and move data around. Yes that's mostly non-disruptive, but it's still quite a lot of work.
It also pretty much assumes for it to be successful in the longer run that your applications do not change their workload patterns and/or pressure more than very slowly. Is this the case?
All in all FabricPool is much much more automatic. It just does the job itself, pretty much w/o fuss once you've tuned it a bit w.r.t. cool down period(s) and things. It "just works". You do need an S3 target system, but as has already been pointed out it can be ONTAP with NL-SAS drives, if you already have a bunch of these lying about you can repurpose those and instead use new Cx00 (or Ax00) nodes in the "front end". The challenge with FabricPool is the network: the connection between the front end and the S3 back end needs to be very good and solid. You have to understand it fully and know every detail how it's built so you know you can trust it's capacity and latency; traffic can be quite bursty.
I'm not very positive to your idea here I'm afraid:
"Not all SSDs will be used for flash-pool, only some for cache and the rest for fast SSD storage."
it's just my (long) experience of this that it's not very productive in reality and it costs a lot of operations (manual work, skilled personnel). It also tends to give you various problems when you need to do HW LCM (upgrade your controllers and disk back ends). It inevitably leads to stranded capacity in more than once dimension as time passes.
/M
Ok so this is a minimal deployment: just one (1) HA-pair FAS8300. This "archive" storage, is it for pure compliance reasons? (You mention writing it out to tape even...)
I thought for speeding the NL-SAS aggregates a little bit up, we also use some SSDs as flash-pool, like we have it now on our old dev NetApp cluster.
Sure, it is advisable definitely to have Flash as a chache in such a system. But you won't need much at all that's my tip. Do the simulations with AWA over at least 4 weeks (like Johan Gislén wrote) and see for yourself.
Now, if you know your application and use cases well, you will know if the data written to all these NL-SAS drives will be read a lot occasionally and if it will be random R (cache will help, and be necessary) or seq R. If the latter, the SSDs won't really do much for you; the data will be sucked in from spinning in that scenario and 7.2K rpm drives are VERY VERY slow... you risk spending $$$ on SSD for almost no benefit if your scenario is like that.
Again: if you know your workload, and it is indeed very light -- it's "archive" type data in the true sense and it pretty much just sits there on NL-SAS once it's been written there once, you could just as well skip FlashPool and use an adequate amount of FlashCache. It won't cache W of course, but for archive type use cases it's unlikely to matter.
While it is possible to tune FlashPool a bit, there are quite a few parameters you can change, but it's hard to make a difference IRL (I tried it once with our heavy NFS aggregated workload and then just gave it up, not worth the effort). If you know you have a large portion of random overwrites in your workload (>> 10%) then FlashPool will "win" over FlashCache. For READ they're the same pretty much and I cannot believe you'd ever notice any difference.
This is too little info for me to understand:
The other SSDs will be used for backup and DR purposes.
Do you perhaps mean that the 100% SSD Aggrs you plan to put in this FAS300 node pair are for DR purposes? DR of what? You perhaps plan to sync-mirror data from your AFF based production cluster to this FAS8300? That's fine if the workload is small enough for the FAS8300 to handle it in your DR situation, but if I were you I would think long and hard about how to recover from such a potential state where [part of] your production workload goes to the FAS8300's SSD Aggr... I.e. how do you get back to your normal production state once this has happened? If you cannot do that in any way that makes sense, the cure might be worse than the decease to to speak.
I would also think through very thoroughly what your definition of "disaster" is (in your specific situation) and which ones exactly this DR you're referring to will protect from. It's always a complex optimisation problem.
We think about moving some data to the 8300 cluster in the long term,
So this data would be "low pressure" production data on your AFF cluster now, I take it. It's not very intense but still not "archive" type data. So putting such data on very slow spinning, is often dangerous in that it risks getting performance issues. And the FlashPool might not help as much as you would wish, even if you have lots of it. This is the kind of scenario which will inevitably give you headaches in the long run, moving data back and forth between different clusters isn't even non-disruptive. How can you be sure that data you've moved to this slow FAS8300, doesn't "pick up speed" again later and the application/data owners start to complain? How can you know that you have adequate space at that point in time to migrate it back to your AFF based production cluster? If you know this, then no prob!
The very good thing about AFF (Cx00 and Ax00) is that you don't have to care. You can throw anything and everything at it and all the workloads will just be absorbed w.r.t. the back end -- it's a gift from Flash Land. The limiter will be the CPU utilisation in the node itself. For this type of scenario I strongly recommend you leverage FabricPool (you need an S3 back end). The AFF Ax00 or Cx00 will have all Storage Efficiency running all the time and this will be preserved when sent out into S3 Buckets. You can't run full Storage Efficiency Chain on your FAS8300 with slow NL-SAS and FlashPool. (It's supported AFAIK, but it will inevitably bite you.)
I haven't looked deeper into it [FabricPool], as we are not using S3 at all in the moment. This was some years ago. As we always had only one all-flash cluster, I haven't thought about it.
Well, if you happen to have NetApp gear (older FAS) incl lots of NL-SAS shelves, then definitely you should start running FabricPool on this one AFF based production cluster you have. You still have to have some sort of backup (SnapMirror/-Vault) just as you have now (I assume). If you have lots of NL-SAS shelves already, but lack controllers, you can buy some for a small sum of money. FP will automagically move all the "cold" WAFL blocks out to S3 based storage and ONTAP S3 is *fast*. No problem there ever, the (only) challenge for you is to make sure the network connection between the two clusters is rock solid.
Should fabricpool not also work on a 2-node cluster? So instead of using some SSDs for flashpool, we could create an aggregate on SSD and one on NL-SAS and use the NL-SAS one for S3 storage and then forlocal fabricpool?
Yes, this way of doing things (FabricPool internally inside the same cluster) should work. Not sure if you can do it within the same *node* though, it may be that you have to have the S3 Bucket on a different node than the S3 client (= the FabricPool back end).
Please anyone correct me if I'm wrong around this.
I agree that if this type of FP setup you describe is supported with a 2-node FAS8300, it's not a bad idea at all.
/M
-------- Original Message -------- Subject: Re: Question about flash pool maximum SSD size and local tiering Date: Tue, 17 Oct 2023 14:12:28 +0000 (UTC) From: Florian Schmid fschmid@ubimet.com To: Michael Bergman michael.bergman@norsborg.net CC: Toasters toasters@teaparty.net
Hi Michael,
wow, thank you very much for your time writing this very detailed explanation!
It will be one 2-node 8300 cluster, switchless. The cluster will be mainly used for long time archive storage until it is going to tape or for tape restores. For this, we want to take a huge amount of NL-SAS drives.
I thought for speeding the NL-SAS aggregates a little bit up, we also use some SSDs as flash-pool, like we have it now on our old dev NetApp cluster.
The other SSDs will be used for backup and DR purposes. We have a full production all-flash cluster for our normal workloads.
We think about moving some data to the 8300 cluster in the long term, because not all volumes we have now on SSD must be on flash and might consume there too much "expensive" space.
I will also have a deeper look on fabricpool. I had a look already on it in the past, but as I read S3 storage, I haven't looked deeper into it, as we are not using S3 at all in the moment. This was some years ago. As we always had only one all-flash cluster, I haven't thought about it.
Should fabricpool not also work on a 2-node cluster? So instead of using some SSDs for flashpool, we could create an aggregate on SSD and one on NL-SAS and use the NL-SAS one for S3 storage and then for local fabricpool?
Best regards, Florian
FlashPool was almost miraculous in its day, and it's still important. I've seen a bit of a resurgence for FlashPool in the past year for similar reasons to what you seem to have. We see these massive archival systems, and I'm strongly recommending generous FlashPool so whatever random IO might happen will be captured by the SSD layer.
I spent years building database setups, and if I could get 5% of the total dataset size in the form of FlashPool SSD, then virtually all the IOPS would hit that SSD layer. There was often barely any difference between all-flash and Flashpool configurations. There would still be a lot of IO hitting the spinning drives, but it was the sequential IO, which honestly doesn't benefit much from SSD anyway.
That approach mostly went out the window because all-flash got affordable. Even if you didn't technically need all-flash at the moment, it was cheap enough and futureproof. A second reason is the size of spinning drives. We used to regularly sell systems with eight SSDs and 500 spinning drives. There was a decent amount of spinning disk IOPS to go around. These days, you're often buying dramatically fewer spinning drives, which means it's easier to push them to their collective IOPS limits. FlashPool can be a nice cushion against IO surges.
I'd also recommend taking a look at C-Series. The whole point of C-Series is all-flash capacity projects. It's the natural upgrade path for hybrid SSD systems. I don't know what price looks like. Some customers are definitely swapping hybrid for C-Series, but there are also some huge capacity projects where that doesn't quite make financial sense.
Someday, though. Someday there will be no hybrid spinning disk systems and it will all be on these capacity-based all-flash systems, but there is still a role for FlashPool at present.
-------- Original Message -------- Subject: Re: Question about flash pool maximum SSD size and local tiering Date: Tue, 17 Oct 2023 14:12:28 +0000 (UTC) From: Florian Schmid fschmid@ubimet.com To: Michael Bergman michael.bergman@norsborg.net CC: Toasters toasters@teaparty.net
Hi Michael,
wow, thank you very much for your time writing this very detailed explanation!
It will be one 2-node 8300 cluster, switchless. The cluster will be mainly used for long time archive storage until it is going to tape or for tape restores. For this, we want to take a huge amount of NL-SAS drives.
I thought for speeding the NL-SAS aggregates a little bit up, we also use some SSDs as flash-pool, like we have it now on our old dev NetApp cluster.
The other SSDs will be used for backup and DR purposes. We have a full production all-flash cluster for our normal workloads.
We think about moving some data to the 8300 cluster in the long term, because not all volumes we have now on SSD must be on flash and might consume there too much "expensive" space.
I will also have a deeper look on fabricpool. I had a look already on it in the past, but as I read S3 storage, I haven't looked deeper into it, as we are not using S3 at all in the moment. This was some years ago. As we always had only one all-flash cluster, I haven't thought about it.
Should fabricpool not also work on a 2-node cluster? So instead of using some SSDs for flashpool, we could create an aggregate on SSD and one on NL-SAS and use the NL-SAS one for S3 storage and then for local fabricpool?
Best regards, Florian _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters
On 2023-10-18 19:28, Jeffrey Steiner wrote:
FlashPool was almost miraculous in its day, and it's still important. I've seen a bit of a resurgence for FlashPool in the past year for similar reasons to what you seem to have. We see these massive archival systems, and I'm strongly recommending generous FlashPool so whatever random IO might happen will be captured by the SSD layer.
For random R IOPS, a big FlashCache will do the job just as well. Arguably better even, depending on the workload (the two are caching at different levels, FlashCache is a victim cache underneath the WAFL buffer cache whereas FlashPool is very different from that).
Sure you can have bigger FlashPool than -Cache, but in reality it will not matter that's my experience of having lots of unstructured file data "archive type" in very large NL-SAS Aggrs with FlashPool for a large no of years.
If you have a lot of random overwrite in your workload (>> 10%) then FlashPool will win, when there's 7.2k rpm drives in the back end.
YMMV but when testing things with AWA and measuring in real production, it's not easy to get a large aggregated file storage workload coming into a 7.2k rpm based ONTAP Aggr to perform well most of the time even if you have a big FlashPool. No matter how you tune it (I did try), it tends to not be used (filled up) as much as you'd expect and more IOPS go to spinning more often than you'd like. (E.g. our 8 TB FlashPool per Aggr was never ever filled to more than 20-30% so there was a waste there; stranded capacity)
It's different of course if you have a well defined application and its behaviour is known. Ideally the working set size and its temporal locality needs to be such that it "suits" how FlashPool works to leverage a large FlashPool size. How to match this is beyond me to be honest, very few NetApp customers would be even close to knowing any of these things about their workloads.
All this said: it's MUCH MUCH better to have a "too large" FlashPool/-Cache than nothing on a 7.2K rpm based Aggr!
The difficulty is to not overspend on SSD's in this scenario, because NetApp's price model makes SSD shelves very very expensive.
N.B. I'm not experienced at all with workloads coming from databases. For that stuff you'd all be wise to listen to Jeff ;-)
I agree that looking at Cx00 is a good idea here in this use case and then leverage FabricPool in a smart way. I also concur here:
"...but there are also some huge capacity projects where that [C-series, large QLC Flash] doesn't quite make financial sense."
Depending on your definition of huge, but let's say PiB scale. Today and for the foreseeable future (5 y) there's no way Flash will be able to compete with large spinning 7.2K rpm NL-SAS in terms of $/(TiB*month). Perhaps not even in the next 10 y.
And the cheapest for true archiving use cases is still tape. To this day. I don't expect this to change soon either.
/M
Jeffrey Steiner wrote:
I spent years building database setups, and if I could get 5% of the total dataset size in the form of FlashPool SSD, then virtually all the IOPS would hit that SSD layer. There was often barely any difference between all-flash and Flashpool configurations. There would still be a lot of IO hitting the spinning drives, but it was the sequential IO, which honestly doesn't benefit much from SSD anyway.
That approach mostly went out the window because all-flash got affordable. Even if you didn't technically need all-flash at the moment, it was cheap enough and futureproof. A second reason is the size of spinning drives. We used to regularly sell systems with eight SSDs and 500 spinning drives. There was a decent amount of spinning disk IOPS to go around. These days, you're often buying dramatically fewer spinning drives, which means it's easier to push them to their collective IOPS limits. FlashPool can be a nice cushion against IO surges.
I'd also recommend taking a look at C-Series. The whole point of C-Series is all-flash capacity projects. It's the natural upgrade path for hybrid SSD systems. I don't know what price looks like. Some customers are definitely swapping hybrid for C-Series, but there are also some huge capacity projects where that doesn't quite make financial sense.
Someday, though. Someday there will be no hybrid spinning disk systems and it will all be on these capacity-based all-flash systems, but there is still a role for FlashPool at present.