Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Add Multi-GPU Training Support #2435

Merged

Conversation

ashwinvaidya17
Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 commented Nov 25, 2024

📝 Description

Testing script

#!/bin/bash

models=("Cfa" "Cflow" "Csflow" "Dfkde" "Dfm" "Draem" "Dsr" "EfficientAd" "Fastflow" "Fre" "Ganomaly" "Padim" "Patchcore" "ReverseDistillation" "Rkde" "Stfpm" "Uflow" "VlmAd" "WinClip" "AiVad")

# Loop through each model and run the anomalib train command
for model in "${models[@]}"; do
    anomalib train --model "$model" --data MVTec --trainer.max_epochs 2 --trainer.devices 2 --trainer.strategy='ddp_find_unused_parameters_true'
done

Works

  1. CFA
  2. CFlow
  3. CSFlow
  4. Dfkde
  5. Dfm
  6. Dsr
  7. Fastflow
  8. Ganomaly
  9. Padim
  10. Patchcore
  11. ReverseDistillation
  12. Stfpm
  13. Uflow
  14. WinCLIP
  15. EfficientAd
  16. VlmAd
  17. AiVad // visualization stage does not work
  18. Draem
  19. Fre

Not Working

✨ Changes

Select what type of change your PR is:

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔨 Refactor (non-breaking change which refactors the code base)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Ashwin Vaidya <[email protected]>
Signed-off-by: Ashwin Vaidya <[email protected]>
Signed-off-by: Ashwin Vaidya <[email protected]>
@@ -600,17 +600,18 @@ def validate_gt_mask(mask: torch.Tensor | None) -> Mask | None:
if mask is None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to revisit the docstrings for this method

@ashwinvaidya17 ashwinvaidya17 marked this pull request as ready for review December 6, 2024 13:20
@ashwinvaidya17 ashwinvaidya17 marked this pull request as draft December 6, 2024 15:23
Signed-off-by: Ashwin Vaidya <[email protected]>
@ashwinvaidya17 ashwinvaidya17 changed the title [WIP] Multi-GPU fixes Multi-GPU fixes Dec 9, 2024
@ashwinvaidya17 ashwinvaidya17 marked this pull request as ready for review December 9, 2024 15:12
Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 70.58824% with 10 lines in your changes missing coverage. Please review.

Please upload report for BASE (feature/v2@c73e411). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/anomalib/data/validators/torch/video.py 77.77% 2 Missing ⚠️
src/anomalib/metrics/evaluator.py 77.77% 2 Missing ⚠️
...rc/anomalib/models/video/ai_vad/lightning_model.py 33.33% 2 Missing ⚠️
...malib/models/components/base/memory_bank_module.py 66.66% 1 Missing ⚠️
...models/components/classification/kde_classifier.py 0.00% 1 Missing ⚠️
src/anomalib/models/image/dfm/torch_model.py 0.00% 1 Missing ⚠️
src/anomalib/models/image/dsr/anomaly_generator.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             feature/v2    #2435   +/-   ##
=============================================
  Coverage              ?   78.38%           
=============================================
  Files                 ?      302           
  Lines                 ?    12940           
  Branches              ?        0           
=============================================
  Hits                  ?    10143           
  Misses                ?     2797           
  Partials              ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@samet-akcay samet-akcay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, thanks 🔥

@ashwinvaidya17 ashwinvaidya17 merged commit 8bd06a9 into openvinotoolkit:feature/v2 Dec 10, 2024
7 checks passed
@samet-akcay samet-akcay changed the title Multi-GPU fixes 🚀 Add Multi-GPU Training Support Dec 10, 2024
Signed-off-by: Ashwin Vaidya <[email protected]>
Signed-off-by: Ashwin Vaidya <[email protected]>
@haimat
Copy link

haimat commented Dec 10, 2024

Awesome, these are great news, thanks a lot!
When do you plan to release an official update with this change?

@samet-akcay
Copy link
Contributor

Our plan is to pre-release before the 20th

@haimat
Copy link

haimat commented Jan 8, 2025

Hello, we are so much looking forward to multi-GPU training :)
Do you know when you can release this change?

@samet-akcay
Copy link
Contributor

samet-akcay commented Jan 8, 2025

Hi, as soon as passing this CI :)
#2465

We didn't release it on the 20th of December mainly because we thought the documentation is not sufficient. We worked on the documentation during Christmas.

@blaz-r kindly added a new algorithm to v2 as well, but one of the tests are failing now. As soon as we fix the test, we'll release, which we hope to sort out by the end of today.

@haimat
Copy link

haimat commented Jan 8, 2025

Hi, as soon as passing this CI :) #2465

Awesome, thanks!

@blaz-r
Copy link
Contributor

blaz-r commented Jan 8, 2025

Hi, as soon as passing this CI :) #2465

We didn't release it on the 20th of December mainly because we thought the documentation is not sufficient. We worked on the documentation during Christmas.

@blaz-r kindly added a new algorithm to v2 as well, but one of the tests are failing now. As soon as we fix the test, we'll release, which we hope to sort out by the end of today.

I think I found the issue, I'm going to open a PR in a few mins. Apologies about that.

@samet-akcay
Copy link
Contributor

samet-akcay commented Jan 8, 2025

Hi, as soon as passing this CI :) #2465
We didn't release it on the 20th of December mainly because we thought the documentation is not sufficient. We worked on the documentation during Christmas.
@blaz-r kindly added a new algorithm to v2 as well, but one of the tests are failing now. As soon as we fix the test, we'll release, which we hope to sort out by the end of today.

I think I found the issue, I'm going to open a PR in a few mins. Apologies about that.

no worries @blaz-r, I think I've fixed it already, let's wait for the test results
76dd186

@blaz-r
Copy link
Contributor

blaz-r commented Jan 8, 2025

Good @samet-akcay, that might fix it, but in case it doesn't I also opened a PR that also adds device= to that same lines in #2490 .

@samet-akcay
Copy link
Contributor

Good @samet-akcay, that might fix it, but in case it doesn't I also opened a PR that also adds device= to that same lines in #2490 .

Looks like it is passing
https://github.com/openvinotoolkit/anomalib/actions/runs/12670086114/job/35308989508?pr=2465

@blaz-r
Copy link
Contributor

blaz-r commented Jan 8, 2025

Great! 😄

@haimat
Copy link

haimat commented Jan 24, 2025

@samet-akcay Heyho, just wanted to ask for the current status of multi-GPU training?
Still working on it, any planned release date?

@samet-akcay
Copy link
Contributor

You could try it with pip install anomalib==2.0.0b2

@haimat
Copy link

haimat commented Jan 24, 2025

Thanks, that is great news - I will give it a try 👍
Do you have any docs on the breaking API changes?

@haimat
Copy link

haimat commented Jan 27, 2025

@samet-akcay I tried to get beta 2 up and running, however, I am not sure how to update our workflow.
For example, neither the Folder nor the Engine classes accept the task argument any more.
How else do I specify the task?
Could you provide a short example of how a basic classificiation training with a given folder would look like in version 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants