Skip to content

samuelbohl/RelaySGD

Repository files navigation

RelaySGD

Implementation of the decentralized learning algorithm RelaySGD1 inside of Bagua2.

You can run the benchmark using an installed version of bagua with:

python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay

You can also provide some parameters:

python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay --lr <learning rate> --alpha <data heterogeneity parameter> --topology <relay togology e.g. chain>

Experiment Evaluation

The logs folder contains the output of all the runs.

To tune the hyperparameters, modify and run the following scripts: hpt_relay.sh and hpt_rest.sh. The output is saved in the logs folder as summary*.txt. The final_run.sh script executes the below shown experiment using the best learning rates on 8 GPUs.

The second experiment evaluates the throughput of different algorithms. (synth_benchmark_run.sh)

CIFAR10 - VGG11

Comparing the decentraliced algorithm in bagua with RelaySGD

RelaySGD vs Allreduce

Comparing different topologies of RelaySGD

Footnotes

  1. https://doi.org/10.48550/arXiv.2110.04175

  2. https://github.com/BaguaSys/bagua/tree/master

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published