“DARE: High-Performance State Machine Replication on RDMA” (2015)
DBLP: https://dblp.uni-trier.de/rec/html/conf/hpdc/PokeH15
Summary: This system paper describes a RDMA-based protocol for linearizable state machine replication. It uses one-sided RDMA writes for data replication during failure-free operation. In addition, the authors describe how to exploit RDMA Queue Pair state transitions to control local memory access during leader election and failure situations.
“FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs” (2016)
DBLP: https://dblp.org/rec/conf/osdi/KaliaKA16
Summary: Contrary to the early systems leveraging reliable one sided RDMA, FaSST uses two-sided unreliable datagram RPCSs. FaSST provides scalable, serializable and durable distributed transactions. The rather unusual design choice to use unreliable datagram is justified with several micro-benchmarks. Kalia et al. showed that one-sided RDMA has deep-rooted technical issues when it comes to scalability. The communication over connected transport overflows the NIC internal cache, especially in N to N scenarios. Furthermore, this paper demonstrates several advantages of their FaSST RPC abstraction layer. The Transactions are implemented on top of that layer and records are stored in the MICA hash table. The main takeaway is that two sided unreliable datagram achieves better performance and outperforms several one-sided implementations like FARM from Microsoft.
“Datacenter RPCs can be General and Fast” (2019)
DBLP: https://dblp.org/rec/conf/nsdi/KaliaKA19
Summary: This paper questions the common believe that datacenter networking software must sacrifice generality to attain high performance. The general purpose libraries TCP and gRPC are often too slow for datacenter-grade usage, whereas DPDK and RDMA are certainly fast but not general and complex. Kalia et al. try to bridge the gap in their paper with eRPC a general and fast RPC implementation. eRPC is evaluated according to three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. Furthermore, it handles packet loss, congestion, and background request execution if it runs on lossy network. In micro benchmarks, one CPU core handles up to 10 million small RPCs per second, or send large messages at 75 Gbps. eRPC achieves 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.