Achieving 2.8M IOPS with 100Gb NVMe-oF
Looking for high performance NVMe-oF? Here’s how.
At Flash Memory Summit 2018, Kazan Networks publicly demonstrated the world’s fastest NVMe-oF target bridge, codenamed “Fuji”. This new ASIC is capable of pushing more NVMe-oF traffic down a 100Gb Ethernet cable than any other device existing today. Period. And we’re going to share with you how we achieve this.
First, let’s set up the hardware. This will take two mid-performance servers simply to keep up with generating 2.8M I/O requests and ingesting 2.8M data packets per second. Use two of your favorite 100Gb RNICs, such as a Chelsio T62100 or a Mellanox CX5. Note that one runs iWARP and the other RoCE, but that’s okay – Fuji supports both. Install Ubuntu release 16.10 running Linux kernel 4.13, which contains the NVMe-oF Initiator drivers inbox.
Because we’re connecting two servers to a single target, an Ethernet switch is required. We typically use Broadcom- or Mellanox-based switches, but anything that supports 50Gb and 100Gb should work fine. Set the MTU size of the switch to 9216B, enable Global Flow Control, and since 100Gb is involved, turn RS FEC on.
Next, connect Fuji to this switch using a 100Gb cable, and configure those ports (and Fuji) to run 2x50Gb – you’ll achieve about 10% higher performance in this mode. On the PCIe side of Fuji, connect it through a PCIe switch (Broadcom or Microchip) in a 2×8 Gen3 configuration. And then put enough SSDs on the other side of the PCIe switch to be able to keep up with this IOPS rate.
In its FMS18 demo, Kazan used eight “Galant Fox” SSDs from WDC, model HUSPR3216. Assuming you also use 8 SSDs, you’ll need a drive model that can source around 400K IOPS at 4kB, so that eight of them can easily achieve the 2.8M IOPS target performance.
You’ll end up with a configuration that looks like this:
Now onto configuring the servers. Start by setting the RNICs with an MTU size of 5000B – we want the 4kB I/Os to complete in a single Ethernet frame.
Then you will want to optimize a couple parameters in the Host server, namely:
- Enable receive adaptive coalescing
#ethtool -C ens2 adaptive-rx on
- Map the RNIC interrupts evenly across all available CPUs. Here’s a script to accomplish that:
num_cpus=$(( `grep -c processor /proc/cpuinfo` ))
echo “num_cpus = $num_cpus”
for i in /sys/bus/pci/drivers/mlx*_core/*/msi_irqs/*
echo “setting IRQ $irq affinity to core $j”
echo $j > /proc/irq/$irq/smp_affinity_list
Next, discover available SSDs using a command like this:
#nvme discover -t rdma -a 10.10.10.167 -s 4420
replacing the IP address with the correct one for your configuration.
You’ll now want to ensure that half of your SSDs get mapped to one server and the balance to the second server to create a symmetrical configuration. In the case of the 8 SSDs Kazan used, the first four connections are established on the first server as:
#nvme connect -t rdma -s 4420 -a 10.10.10.167 -i 8 -n nqn.2015-09.com.kazan-networks:nvme.1
#nvme connect -t rdma -s 4420 -a 10.10.10.167 -i 8 -n nqn.2015-09.com.kazan-networks:nvme.2
#nvme connect -t rdma -s 4420 -a 10.10.10.167 -i 8 -n nqn.2015-09.com.kazan-networks:nvme.3
#nvme connect -t rdma -s 4420 -a 10.10.10.167 -i 8 -n nqn.2015-09.com.kazan-networks:nvme.4
And the balance of SSDs connected to the second server as:
#nvme connect -t rdma -s 4420 -a 10.10.10.168 -i 16 -n nqn.2015-09.com.kazan-networks:nvme.5
#nvme connect -t rdma -s 4420 -a 10.10.10.168 -i 16 -n nqn.2015-09.com.kazan-networks:nvme.8
Now fire up FIO to generate 4kB random Read I/Os on both servers, see about 1.4M IOPS across each, and add those together.
That’s 2.8M IOPS through a single 100Gb Fuji bridge ASIC and it just doesn’t get any faster than that!
(Details and FIO scripts are available in a Fuji performance app note – contact Kazan Networks directly for additional information.)