Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Imagenet Example #680

Merged
merged 11 commits into from
Nov 8, 2023
Merged

Adding Imagenet Example #680

merged 11 commits into from
Nov 8, 2023

Conversation

PareesaMS
Copy link
Contributor

This example activated DeepSpeed on the implementation of training a set of popular model architectures on ImageNet dataset. The models include ResNet, AlexNet, and VGG, and the
baseline implementation could be found at pytorch examples Github repository. DeepSpeed activation allows for ease in
running the code in distributed manner, allowing for easily applying fp16 quantization benefitting Zero stage1 memory reduction.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
## DeepSpeed Optimizations

Applying fp16 quantization and Zero stage 1 memory optimization we were able to reduce the required memory. The table bellow summarizes the results of running resnet 50 on one
node 16 V100 GPUs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a DGX-1 node (with 16 V100 GPUs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

------------------|-------------------

Furthermore, the memory optimization had no adverse impact on accuracy, a point illustrated by the graph below.
![resnet-plot](C:\Users\pagolnar\OneDrive - Microsoft\Reports-presentations\Resnet-plot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image link is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

Baseline| ? | -
Baseline with DS activated | 1.66 | -
DS + fp16 | 1.04 | ?
Ds + fp16 + Zero 1 | 0.81 | ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides memory, how about the training speed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the table. Did not measure the training speed. Should I repeat the experiments?

ImageNet dataset is large and time-consuming to download. To get started quickly, run `main.py` using dummy data by "--dummy". It's also useful for training speed benchmark. Note that the loss or accuracy is useless in this case.

```bash
python main.py -a resnet18 --dummy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is deepspeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

@@ -0,0 +1,2 @@
torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepspeed is also a requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. Fixed the issue

Baseline| ? | -
Baseline with DS activated | 1.66 | -
DS + fp16 | 1.04 | ?
Ds + fp16 + Zero 1 | 0.81 | ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table format is not correct. take a look at rendered website

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

@mrwyattii mrwyattii merged commit ccb2a34 into master Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants