The Convergence of NVMe Flash Storage and TCP Networking

Last week something huge happened, the NVMe Work Group ratified the NVMe/TCP standard. So you think, what’s the big deal? NVMe is performance FLASH memory, mainly the storage in your smartphone, only a whole lot faster. So what does TCP add to the equation? Our vision with NVMe/TCP was to provide the promise of ubiquitous ultra-high performance storage. NVMe will replace spinning disks in computers; it’s just a matter of time given that they’re over 1,000X faster in all the metrics that matter. The cost of NVMe storage can still be a limiting factor, but it is dropping rapidly, a variation of Moore’s law. Now couple that with the Transmission Control Protocol (TCP), the language of the Internet, and you’ve got something compelling.

Before this week if you wanted to access NVMe storage over a network, you needed to use RDMA over Converged Ethernet (RoCE) or Fiber Channel. While RoCE works over Ethernet, it’s like taking a NASCAR out to the market for groceries, you can do it, but it won’t be an easy trip, and likely not one you’ll want to repeat. As for fiber channel, you’re using the same networking hardware you’ve just swapped out the software stack, not much of a gain. TCP, on the other hand, is the game changer. Nearly every computer on the planet runs TCP, from the Amazon Echo on your kitchen counter-top to the Apple Watch on your wrist, Initially designed for the military TCP is EVERYWHERE, its original design criteria were fault tolerance and automatic routing around outages. TCP means that someday soon your devices will have reliable, seamless access to performance storage ANYWHERE in the world. So how did we get here?

In January of 2017, Solarflare did an informal survey of cloud customers and uncovered an emerging trend toward disaggregated storage over Ethernet. Efforts were already underway to move the new NVMe Flash standard to storage fabrics using both RoCE and Fibre Channel. Solarflare decided its patented TCP kernel bypass acceleration technology, known for half round trip performance under 900 nanoseconds, could be applied to NVMe-oF. Solarflare’s TCP latencies compared very favorably against RoCE. Within a month Solarflare delivered sample code which demonstrated that performance NVMe over TCP was not only possible but competitive. The initial code was delivered as an extension to the Linux kernel, and it in-fact would work on anyone’s Network Interface Card (NIC), not just Solarflare’s performance NICs. By May of 2017, Solarflare had posted this code to GitHub and shared it with the NVMe Work Group. At this point, Solarflare’s newer TCP code was delivering performance within 10% of RoCE. At the Flash Memory Summit in August of this year, Solarflare demonstrated 20-microsecond access to Intel Optane NVMe storage over TCP. This demonstration showed now that TCP performance was directly competitive with that of RoCE.

Two weeks ago Solarflare visited the University of New Hampshire’s (UNH) Inter-Operability Laboratory (IOL) testing both their NVMe-oF/TCP Initiator (client) and Target (server). One of the engineers from another company who had crafted their NVMe-oF/RoCE stack was getting their TCP stack working and once it came up he exclaimed “Wow! Compared to RoCE this is so easy!” At this UNH IOL event, Solarflare was testing its new kernel bypass NVMe-oF/TCP Target; the one demonstrated at the Flash Memory Summit and its kernel mode Initiator. The target enables storage vendors who use it to tune their solution with our kernel bypass stack to achieve the best possible TCP performance for their storage appliances.

Conversely, a kernel mode initiator means that every server or client can easily attach to this storage without any special hardware. So, what does NVMe-oF/TCP mean? It means that servers no longer require a second dedicated storage network, like fiber channel, to attach remote high-performance NVMe storage. So as Ethernet networks move from 10GbE to 25GbE and on to 100GbE and 400GbE beyond that, storage traffic can easily transit these networks without any additional changes or effort. It’s all just Ethernet. TCP means that storage servers can quickly scale to 100GbE connections while compute servers, and other storage clients who use the initiator, can remain unchanged at 25GbE or even 10GbE. Also, older systems with a small software upgrade to the new initiator can easily connect to performance storage over their existing network connections. Unlike anything else, TCP democratizes performance NVMe storage making it available to any system that has the initiator built into its kernel. While delivering in-box NVMe-oF/TCP may take another year or two to drip through the Operating System providers and distributions, as a ratified standard it’s now only a matter of time before it will be available everywhere. Once NVMe-oF/TCP is in-box from your OS provider we’ll all have access to high-performance ubiquitous storage, mission accomplished!