Skip to content
This repository has been archived by the owner on Feb 25, 2024. It is now read-only.

Distributed Training Examples & Scalability Benchmarks #11

Open
avik-pal opened this issue Feb 2, 2022 · 3 comments
Open

Distributed Training Examples & Scalability Benchmarks #11

avik-pal opened this issue Feb 2, 2022 · 3 comments

Comments

@avik-pal
Copy link
Owner

avik-pal commented Feb 2, 2022

Currently, FluxMPI has only 1 example. It would be good to showcase training of more image models -- ViT (FluxML/Metalhead.jl#105), ResNets, etc. from Metalhead and also benchmark their scaling across multiple GPUs.

@dnabanita7
Copy link

I am not sure if it's appropriate to raise the concern here and it is vaguely related to this issue but for benchmarks, Can I suggest something of the sort like mlpack benchmarks. I really like how they are using valgrind for memory benchmarks and profiling, sqlite to store results, etc. The comparison amongst other ML libraries provides a better depiction of Flux and why to use Flux over other libraries.

@avik-pal avik-pal changed the title Distributed Training Examples/Benchmarks Distributed Training Examples & Scalability Benchmarks Feb 2, 2022
@avik-pal
Copy link
Owner Author

avik-pal commented Feb 2, 2022

I think that might be more relevant for FluxBench. I mainly want to test the scalability across GPUs something like https://horovod.readthedocs.io/en/stable/benchmarks.html

@CarloLucibello
Copy link

Can I ask for a minimal example without FastAI.jl? e.g. I'd like to see how this script should be changed for distributed training:
https://github.com/FluxML/model-zoo/blob/master/vision/vgg_cifar10/vgg_cifar10.jl

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants