description |
---|
An improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets. |
Presented in SIGCOMM 2018.
Authors: Radhika Mittal (UC Berkeley), Alexander Shpiner (Mellanox), Aurojit Panda (NYU), Eitan Zahavi (Mellanox), Arvind Krishnamurthy (UW), Sylvia Ratnasamy, Scott Shenker (UC Berkeley).
- This paper proposes an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packets.
- It shows that PFC is not fundamentally required to support RoCE.
- It shows that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios.
- Infiniband RDMA
- Long used in the HPC community.
- Using credit-based flow control to make the network lossless.
- Not designed to efficiently recover from packet losses, because packet drops are rare in such clusters.
- Mechanism
- When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender.
- When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission).
- RoCE enables the use of RDMA over Ethernet (also IP-routed networks).
- Adopt the same Infiniband transport design.
- Using PFC to make the network lossless.
- Priority Flow Control (PFC)
- Ethernet’s flow control mechanism.
- A switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold.
- When the queue drains below this threshold, an X-ON frame is sent to resume transmission.
- Limitation: various performance issues; make network harder to understand and manage.
- iWARP vs RoCE
- iWARP implement the entire TCP stack in hardware; need to translate TCP's byte stream semantics to RDMA segments.
- iWARP is significantly more complex and expensive than RoCE, with inferior performance.
- Does RDMA require a lossless network (which includes PFC)?
- The answer is no.
- Improve the loss recovery mechanism.
- Selective retransmission (inspired by TCP’s loss recovery).
- BDP-FC mechanism.
- Basic end-to-end packet level flow control, which bounds the number of in-flight packets by the bandwidth-delay product (BDP) of the network (as suggested in pFabric).