Introduction#
OpenAI has announced a partnership with several leading technology companies, including AMD, Broadcom, Intel, Microsoft, and Nvidia, to develop a new open-standard protocol called Multipath Reliable Connection (MRC). This protocol aims to improve the efficiency of massive AI training clusters by eliminating network bottlenecks.
What is MRC?#
The Multipath Reliable Connection protocol builds on an existing technology known as RDMA over Converged Ethernet. This technology allows for faster data transfer between hardware components like Graphics Processing Units (GPUs) and Central Processing Units (CPUs) by enabling direct memory access. MRC enhances this by dividing network connections into smaller links, creating multiple pathways for data to travel simultaneously. This means that a single data transfer can utilize hundreds of routes through the network, significantly speeding up the process.
Deployment and Usage#
OpenAI has implemented MRC in its large Nvidia GB200 supercomputers, which are essential for training advanced AI models. These systems are located in various facilities, including Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. The protocol has already been used to train several of OpenAI's models, taking advantage of the hardware capabilities provided by Nvidia and Broadcom.
Advantages of MRC#
One of the key features of MRC is its use of IPv6 Segment Routing. This allows data senders to specify the exact path that data packets should take through the network, rather than relying on traditional routing methods that can be slower. MRC is also capable of quickly detecting network failures and rerouting data in microseconds, a significant improvement over older systems that can take much longer to respond.
OpenAI reports that over 900 million people use its ChatGPT service weekly, highlighting the importance of efficient training methods for its AI models. During the training of a recent model, OpenAI successfully rebooted critical network switches without disrupting ongoing training tasks, showcasing the robustness of the MRC protocol.
