-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to (Re)TestItems #262
Conversation
f3dba17
to
68b31c6
Compare
68b31c6
to
61abd23
Compare
29397bf
to
2eac7f4
Compare
We can guarantee these test images will always be available, which is not the case for the current sample image.
Still needs a bit of work for GPU CI on 1.6 and nightly (possibly disabling the latter for now), but this is mostly good to go. Some timings:
It appears we spend a lot of time compiling, as evidenced by the large time savings when similar models are run one after another. ViTs are an outlier despite their relative runtime slowness because they use the (type unstable under AD) Vector |
During my GSoC, we explored this and I had noticed that when training, the Vector Chain gave me extremely bumpy loss curves – one of the reason we removed them from 0.7 to 0.8. A lot of this can come back slowly if we train more to isolate the exact problem, I think. |
With the renewed interest in #198 (comment), now may be the time to revisit what's causing these mysterious instabilities during training. Shall we continue the discussion there? |
Co-authored-by: Kyle Daruwalla <[email protected]>
b00c6c6
to
eee59a9
Compare
`reclaim` to load the CUDA driver and fails otherwise
50% per worker so we avoid
Ok, Buildkite is happy and so am I. This should be good to go. We now should have a pretty good picture of what works and doesn't on GPU too! |
The impetus for this PR was twofold:
Along the way, I found some additional changes which could either be tackled here or in a follow-up PR:
WideResNet
on GHA (fixed)My feeling is that we'd want to set aside a subset of faster tests for 1.6/nightly/GPU CI. Maybe the smallest variant of each model. Then we can decrease our overall runtime while expanding our version matrix to cover everything we probably should've been covering.
PR Checklist