-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/560 reshape speedup #1125
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1125 +/- ##
==========================================
+ Coverage 91.79% 91.81% +0.01%
==========================================
Files 72 72
Lines 10497 10490 -7
==========================================
- Hits 9636 9631 -5
+ Misses 861 859 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
You have successfully added a new CodeQL configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look fine to me.
Thank you for the PR! |
Thank you for the PR! |
Thank you for the PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review for PR#560 "reshape speedup"
This PR contains some improvements for the performance of reshape
. The plots indicate a significant improvement of "new" compared to "main" on up to 2 cores.
In essence, the changes of this PR can be devided into two categories: first, int
-datatype is replaced by the more appropriate int64
at some places; second, the actual reshape
is modified. As far as I can judge, the code is correct and well commented. However, I would like to recomend further numerical experiments on more cores (if not already done so far) in order to ensure that the proposed idea "resplit -> reshape in place -> resplit" (thats so far I understood this) also scales better or at least equal to the implementation in "main".
I can run the additional scaling tests |
Hi @mrfh92 , thanks for the review, I actually ran the scaling tests of this implementation. I will run comparison tests with the released |
It looks like that the "old" reshape is significantly slower than the "new" one (in particular for a small number of processes), but the "old" reshape scales well when increasing the number of processes; the "new" reshape does not seem to scale well (I believe that is due to the usage of resplit which will always have runtime proportional to the number of processes...) Nevertheless, this does not necessarily imply that the "new" reshape is not an improvement compared to the "old" one; in contrary, the "new" version is that much more faster for small number of processes that it stil outperforms the "old" one there. However, there we should determine whether there is some break-even point at which the "old" one becomes better. |
Thanks a lot @mrfh92 for all the tests. I went back and finished the tests I had started on the cluster, adding test runs on the release ( |
Thank you for the PR! |
Thank you for the PR! |
Thank you for the PR! |
Thank you for the PR! |
Thank you for the PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new plots in the PR description show that my worries on possible performance degradation w.r.t. scalability were actually unjustified. Therefore, I now recommend merging if the CI runs through.
Description
While testing Heat+torch 2.0 on Apple MPS #1053 I noticed a major bottleneck in
ht.reshape
even on 1 process. We had been discussing this some time ago, #560, #874The main slow-down arises in setting up the
Alltoallv
communication necessary when the input DNDarray is distributed along non-zero axes (ht.reshape(x, new_shape, new_split)
withx.split > 0
and potentiallynew_split >0
).I ran some experiments and realized that resplitting in place (input split to 0, and then if necessary 0 to new_split) while inelegant still performs much better than the current implementation. This PR implements a pragmatic solution based on
dndarray.resplit_
anddndarray.redistribute_
.Below some plots from tests on my laptop - matrices are still relatively small and 1-process performance is still better than 2-processes, will replace with scaling tests on HDF ML later.UPDATED
Below some tests run on HDF ML, CPU only, 2 CPUs per node, 12 threads per CPU.
Initial matrix:
A = ht.zeros((1000, size), split=key[0])
Timed operation:
B = ht.reshape(A, (10000000, -1), new_split=key[1])
Sampled sizes:
sizes = [500000*n for n in [1, 2, 4, 8, 16, 32, 64, 128, 256]]
Sampled split combinations:
[[0, 0], [1, 1], [1, 0], [0, 1]]
(only showing plots for split combi 0-0, 1-1)Minimum speed-up: ~8.5 for split 0-0, ~3 for split 1-1 which requires more communication.
Issue/s resolved: #560, #874
Changes proposed:
Alltoallv
call with resplitting in place when necessaryType of change
Performance enhancement
Memory requirements
In progress
Performance
In progress, also see plots
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
no