UCCL is an efficient collective communication library for GPUs.
Existing network transports under NCCL (i.e., kernel TCP and RDMA) leverage one or few network paths to stream huge data volumes, thus prone to congestion happening in datacenter networks. Instead, UCCL employs packet spraying in software to leverage abundant network paths to avoid "single-path-of-congestion". With this design, UCCL provides the following benefits:
- Faster collectives by leveraging multi-path
- Widely available in the public cloud by leveraging legacy NICs and Ethernet fabric
- Evolvable transport designs including multi-path load balancing and congestion control
- Open-source research platform for ML collectives
On two AWS g4dn.8xlarge
instances with 50G NICs and T4 GPUs under the cluster placement group, UCCL outperforms NCCL by up to 3.7x for AllReduce:
UCCL currently supports AWS ENA NICs and IBM VirtIO NICs; support for Azure and GCP NICs and RDMA is on the way. It is implemented as an NCCL plugin library with a drop-in replacement for NCCL applications. Here, we show how to run the standard nccl-tests
that leverages UCCL atop two AWS g4dn.8xlarge
instances with T4 GPUs.
-
Create two
g4dn.8xlarge
instances each with a second ENA NIC interface and a public IP:- Login to EC2 console
us-east-1
and clickLaunch instances
- Enter
Name and tags
- Select AMI of
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5 (Ubuntu 22.04)
or the latest version- Alternatively, we have prepared an AMI (
ami-07f7062a5d995d7c4
) to simplify dependency setup in step 2
- Alternatively, we have prepared an AMI (
- Select
g4dn.8xlarge
forinstances types
and choose your ownKey pair
- Click
Edit
forNetworking settings
, then select a random subnet and disableAuto-assign public IP
- Click
Advanced network configuration
, then clickAdd network interface
- Configure security rules to allow any traffic to go through the instances
- Under
Summary
, enter 2 forNumber of instances
- Click
Launch instance
- Back to the EC2 console page, click
Elastic IPs
thenAllocate Elastic IP address
to allocate two public IPs - Back to the
Elastic IPs
page, for each public IP, right-click it toAssociate Elastic IP address
- Click
Network interface
, then enter the first network interface ID of each VM - Click
Allow this Elastic IP address to be reassociated
thenAssociate
- Click
- Now you should be able to login to
VM1
andVM2
via ssh over public IPs - Configure necessary ssh keys to make sure
VM1
can ssh bothVM1
(itself) andVM2
without password- Note that we do not support ssh agent forwarding yet: eg, if you are using
ForwardAgent yes
option in.ssh/config
, you still need to configure the necessary ssh keys on VMs, rather than relying on the key in ssh agent - Eg, you can run
ssh-keygen
onVM1
to generate a temporary pub-priv key pair, then copy the pub key to~/.ssh/authorized_keys
onVM1
andVM2
- Note that we do not support ssh agent forwarding yet: eg, if you are using
- Login to EC2 console
-
Configure the two VM instances for UCCL tests as follows. Note if you have used our provided AMI, you can skip this step.
Click me
- Build
uccl
:-
git clone https://github.com/uccl-project/uccl.git
-
export UCCL_HOME=$(pwd)/uccl
-
Install dependency:
sudo apt update sudo apt install clang llvm libelf-dev libpcap-dev build-essential libc6-dev-i386 linux-tools-$(uname -r) libgoogle-glog-dev libgtest-dev byobu net-tools iperf iperf3 libgtest-dev cmake m4 -y wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh bash ./Anaconda3-2024.10-1-Linux-x86_64.sh source ~/.bashrc conda init
-
Build UCCL: ignore "config.h: No such file or directory" at the end
cd $UCCL_HOME make
-
On Amazon VMs (Skip this step on other environments): Update AWS ENA driver to support zero-copy AF_XDP
# Install last ena driver with reboot persistent sudo apt-get install dkms git clone https://github.com/amzn/amzn-drivers.git -b ena_linux_2.13.0 sudo mv amzn-drivers /usr/src/amzn-drivers-2.13.0 sudo vi /usr/src/amzn-drivers-2.13.0/dkms.conf # Paste the following and save the file: PACKAGE_NAME="ena" PACKAGE_VERSION="2.13.0" CLEAN="make -C kernel/linux/ena clean" MAKE="make -C kernel/linux/ena/ BUILD_KERNEL=${kernelver}" BUILT_MODULE_NAME[0]="ena" BUILT_MODULE_LOCATION="kernel/linux/ena" DEST_MODULE_LOCATION[0]="/updates" DEST_MODULE_NAME[0]="ena" REMAKE_INITRD="yes" AUTOINSTALL="yes" sudo dkms add -m amzn-drivers -v 2.13.0 sudo dkms build -m amzn-drivers -v 2.13.0 sudo dkms install -m amzn-drivers -v 2.13.0 sudo modprobe -r ena; sudo modprobe ena
-
On IBM VMs: Upgrade the Kernel to latest (>6.2) to support AF_XDP For example, on Ubuntu 22.04 image
sudo apt update sudo apt install linux-image-generic-hwe-22.04 sudo apt install -y linux-headers-$(uname -r) build-essential
-
- Build
nccl
andnccl-tests
:cd $UCCL_HOME/nccl make src.build -j cp src/include/nccl_common.h build/include/ cd .. # Consider "conda deactivate" when hitting dependency errors cd $UCCL_HOME/nccl-tests make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=$UCCL_HOME/nccl/build -j cd ..
- Build
-
Run UCCL transport tests on
VM1
:cd $UCCL_HOME && git pull
- Edit
nodes.txt
to only include the two IPs of the VMs - Build UCCL:
python setup_all.py --target aws_g4_afxdp
- Keep
setup_all.py
running Note: This will build and setup UCCL on both VMs
- Run UCCL tests:
cd $UCCL_HOME/afxdp/
- [
VM1
]./transport_test --logtostderr=1 --clientip=<VM2 IP> --test=bimq
- [
VM2
]./transport_test --logtostderr=1 --client --serverip=<VM1 IP> --test=bimq
- [
VM2
] You should be able to see something likeSent 10000 messages, med rtt: 1033 us, tail rtt: 1484 us, link bw 98.3371 Gbps, app bw 95.3775 Gbps
. - If you hit
[util_afxdp.cc:30] Check failed: receive_fd(afxdp_ctl.client_sock_, &afxdp_ctl.umem_fd_) == 0
, trymake -C afxdp/ clean
thenpython setup_all.py --target aws_g4_afxdp
again.
-
Run
nccl-tests
onVM1
:python setup_all.py --target aws_g4_afxdp
cd $UCCL_HOME/afxdp/
./run_nccl_test.sh afxdp 2 <nic>
- You should be able to see
nccl-tests
results.
Please refer to README_dev.md for development setup and testing.
UCCL is being actively developed at UC Berkeley Sky Computing Lab. We welcome contributions from open-source developers.