-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize run_spin_excitation! for GPU #462
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #462 +/- ##
==========================================
+ Coverage 90.91% 90.94% +0.03%
==========================================
Files 53 53
Lines 2916 2926 +10
==========================================
+ Hits 2651 2661 +10
Misses 265 265
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KomaMRI Benchmarks
Benchmark suite | Current: 18bcaee | Previous: 1457a4c | Ratio |
---|---|---|---|
MRI Lab/Bloch/CPU/2 thread(s) |
243027991 ns |
227517325.5 ns |
1.07 |
MRI Lab/Bloch/CPU/4 thread(s) |
135522120 ns |
135033124 ns |
1.00 |
MRI Lab/Bloch/CPU/8 thread(s) |
144774394.5 ns |
171880824 ns |
0.84 |
MRI Lab/Bloch/CPU/1 thread(s) |
408151458 ns |
396561930.5 ns |
1.03 |
MRI Lab/Bloch/GPU/CUDA |
57005066.5 ns |
138134905 ns |
0.41 |
MRI Lab/Bloch/GPU/oneAPI |
527703085 ns |
14155999496.5 ns |
0.037277698768672034 |
MRI Lab/Bloch/GPU/Metal |
543126542 ns |
3171338479 ns |
0.17 |
MRI Lab/Bloch/GPU/AMDGPU |
36831327 ns |
75482754 ns |
0.49 |
Slice Selection 3D/Bloch/CPU/2 thread(s) |
1016083368 ns |
1168211452 ns |
0.87 |
Slice Selection 3D/Bloch/CPU/4 thread(s) |
619287647 ns |
612565463 ns |
1.01 |
Slice Selection 3D/Bloch/CPU/8 thread(s) |
385912553 ns |
495427593 ns |
0.78 |
Slice Selection 3D/Bloch/CPU/1 thread(s) |
2252333331 ns |
2245843835 ns |
1.00 |
Slice Selection 3D/Bloch/GPU/CUDA |
101397562.5 ns |
108701927 ns |
0.93 |
Slice Selection 3D/Bloch/GPU/oneAPI |
662296008 ns |
776956866 ns |
0.85 |
Slice Selection 3D/Bloch/GPU/Metal |
564139375 ns |
769082459 ns |
0.73 |
Slice Selection 3D/Bloch/GPU/AMDGPU |
60677723 ns |
64232156 ns |
0.94 |
This comment was automatically generated by workflow using github-action-benchmark.
This pull request adds a GPU-optimized implementation of run_spin_excitation! to BlochGPU.jl. Compared with the function in BlochSimple.jl, the calculations which can be done beforehand for all time points are stored in preallocated matrices Bz, B, φ, ΔT1, and ΔT2. The sequential calculations are done inside a kernel apply_excitation!, which uses shared memory for all repeated memory accesses, and does the real / imaginary number math in Magnetization.jl directly so that the shared memory arrays store successive 32-bit values, which I think is ideal to avoid bank conflicts (https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/).
The tests pass for Metal on my computer, and I've seen the excitation-heavy MRI Lab benchmark speed up by ~5x. Hopefully the same will be true for the other backends!