Achieving Best-in-Class Latency for Composable Infrastructure
The concept of compute/storage disaggregation – and the Composable Infrastructure it enables – is a hot topic, and well it should be: The drastic benefits to both storage utilization and business agility are becoming well understood. But questions pertaining to the cost in performance of disaggregation are still of interest and deserve answers.
Recently, Kazan Networks showed that it’s now possible to achieve equivalent IOPS and bandwidth performance to that of DAS using our new Fuji ASIC as the NVMe-oF bridge to an Ethernet fabric. This time, let’s take a look at the cost to latency for remote-attached SSDs versus DAS.
Intuitively, many assume that DAS (direct attached storage) configurations will outperform a similar remotely attached configuration, and it’s easy to understand that assumption: DAS, with only a PCIe bus between a host processor and its SSDs, adds minimal latency in both directions. Intuitively, inserting an RNIC, an Ethernet fabric, and an NVMe-oF bridge between the processor and the SSDs must slow things down, right? Well, in some cases (one, actually), that’s true, but in realworkloads, the penalties are extremely close to zero.
Intrigued? Read on…
Let’s consider two configurations for testing and measuring latency performance:
First, we use an example that many begin with to “benchmark” latency performance: Take a drive with a single queue (job) and send a single command to the drive. In this mode, we get the lowest latency performance from the drive, since it’s just working on a single I/O and there aren’t any other I/Os in flight, adding delay.
Using one of the fastest SSDs available, Intel’s® Optane™ (based on 3D-XPoint™ technology), it’s possible to see a single I/O return data in less than 10 µsec – generally about 10x faster than traditional NAND-based SSDs. (This is the drive we use for our latency testing, as it’s less likely that measurement differences get lost in the longer I/O times associated with NAND-based SSDs.)
Second, let’s instead use a much more practical configuration, with multiple queues to the drive and multiple entries per drive. In other words, a real-life use case. (If someone can show me an actual datacenter running a single job per SSD with a queue depth set to one, I’ll buy you dinner!)
So with those two configurations in mind, let’s talk through each – how to set them up and their results.
Single Queue / Single Command
For this “benchmark” configuration, we’ll set up the server as follows: Map all RDMA interrupt vectors to a single core and launch your application (e.g. FIO) on that one core. This minimizes context switching in the server and allows you to optimize round-trip latency for that single command.
Measure “local” SSD performance by running I/O to an Optane SSD installed directly in that server. And then compare that to a disaggregated topology that looks like this:
Here at Kazan, we’ve done this testing and in fact showed this publicly at last month’s Flash Memory Summit in Santa Clara. The results:
4kB Read I/O: Incremental latency of 5 µsec
4kB Write I/O: Increment latency of 8 µsec
We have yet to see any other NVMe-oF target solutions come close to this performance, but let’s keep this in context: This is a benchmark figure with little or no bearing on the real world. So let’s move on to the second configuration.
Multiple Queues / Multiple Commands per Queue
As mentioned previously, real life applications take advantage of both the parallel nature of modern CPUs as well as the parallel nature of NVMe SSDs by employing multiple “jobs” (queues) to each SSD with multiple outstanding I/O commands per job.
So what happens to latency as we increase the I/O loads?
To measure this, configure the host server to spread the queues (and interrupts) evenly across all the cores in the CPU. Again, measure “local” SSD performance and compare that to that of “remote”, sweeping both the number of jobs and the number of entries per queue.
Here is a chart comparing the results for 4kB Read I/Os in the case of 8 jobs with varying queue depths:
And the same for 4kB Write I/Os:
Yes, they are virtually identical in all cases.
Looking now at the percentage differences for Read I/Os across various jobs / queue depths, we see that the worst case difference is just 0.15% (not 15%!), with some corners showing an incremental latency as much as -0.24%, meaning that the average latency actually *improves* with disaggregation!
And the equivalent chart for Write I/Os:
For Writes, the worst case difference is slightly higher at about 1.5%, but the other 15 points are all at or below 0.5%.
How did we achieve this level of scalable, best-in-class latency performance? Through extensive use of hardware acceleration, which we discussed here.
Conclusion: Kazan has previously showed how to achieve essentially identical IOPS / BW performance using NVMe-oF. Now we show that you can achieve essentially identical latency with NVMe-oF, and we’re happy to prove this to you.
The only question remaining is why aren’t you deploying it today?
(Details and FIO scripts are available in a Fuji performance app note – contact Kazan Networks directly for additional information.)