Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/560 reshape speedup #1125

Merged
merged 14 commits into from
May 22, 2023
Merged

Features/560 reshape speedup #1125

merged 14 commits into from
May 22, 2023

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Mar 20, 2023

Description

While testing Heat+torch 2.0 on Apple MPS #1053 I noticed a major bottleneck in ht.reshape even on 1 process. We had been discussing this some time ago, #560, #874

The main slow-down arises in setting up the Alltoallv communication necessary when the input DNDarray is distributed along non-zero axes (ht.reshape(x, new_shape, new_split) with x.split > 0 and potentially new_split >0).

I ran some experiments and realized that resplitting in place (input split to 0, and then if necessary 0 to new_split) while inelegant still performs much better than the current implementation. This PR implements a pragmatic solution based on dndarray.resplit_ and dndarray.redistribute_.

Below some plots from tests on my laptop - matrices are still relatively small and 1-process performance is still better than 2-processes, will replace with scaling tests on HDF ML later.

UPDATED

Below some tests run on HDF ML, CPU only, 2 CPUs per node, 12 threads per CPU.
Initial matrix: A = ht.zeros((1000, size), split=key[0])
Timed operation: B = ht.reshape(A, (10000000, -1), new_split=key[1])
Sampled sizes: sizes = [500000*n for n in [1, 2, 4, 8, 16, 32, 64, 128, 256]]
Sampled split combinations: [[0, 0], [1, 1], [1, 0], [0, 1]] (only showing plots for split combi 0-0, 1-1)
Minimum speed-up: ~8.5 for split 0-0, ~3 for split 1-1 which requires more communication.

reshape_weak

reshape_strong_8x

reshape_strong_64x

Issue/s resolved: #560, #874

Changes proposed:

  • replace Alltoallv call with resplitting in place when necessary
  • updated documentation

Type of change

Performance enhancement

Memory requirements

In progress

Performance

In progress, also see plots

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Title of PR is suitable for corresponding CHANGELOG entry

Does this change modify the behaviour of other functions? If so, which?

no

@codecov
Copy link

codecov bot commented Mar 20, 2023

Codecov Report

Merging #1125 (9bf8442) into main (9f6b2eb) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 9bf8442 differs from pull request most recent head d4a5ede. Consider uploading reports for the commit d4a5ede to get more accurate results

@@            Coverage Diff             @@
##             main    #1125      +/-   ##
==========================================
+ Coverage   91.79%   91.81%   +0.01%     
==========================================
  Files          72       72              
  Lines       10497    10490       -7     
==========================================
- Hits         9636     9631       -5     
+ Misses        861      859       -2     
Flag Coverage Δ
unit 91.81% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/dndarray.py 96.91% <ø> (ø)
heat/core/manipulations.py 99.01% <100.00%> (+0.14%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ghost
Copy link

ghost commented Mar 20, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@github-advanced-security
Copy link

You have successfully added a new CodeQL configuration /language:python. As part of the setup process, we have scanned this repository and found no existing alerts. In the future, you will see all code scanning alerts on the repository Security tab.

mrfh92
mrfh92 previously approved these changes Mar 23, 2023
Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look fine to me.

@ClaudiaComito ClaudiaComito added this to the 1.3.0 milestone Mar 29, 2023
@ClaudiaComito ClaudiaComito changed the base branch from release/1.2.x to main March 29, 2023 10:13
@ClaudiaComito ClaudiaComito dismissed mrfh92’s stale review March 29, 2023 10:13

The base branch was changed.

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92 mrfh92 self-assigned this Apr 17, 2023
@github-actions
Copy link
Contributor

Thank you for the PR!

Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review for PR#560 "reshape speedup"

This PR contains some improvements for the performance of reshape. The plots indicate a significant improvement of "new" compared to "main" on up to 2 cores.

In essence, the changes of this PR can be devided into two categories: first, int-datatype is replaced by the more appropriate int64 at some places; second, the actual reshape is modified. As far as I can judge, the code is correct and well commented. However, I would like to recomend further numerical experiments on more cores (if not already done so far) in order to ensure that the proposed idea "resplit -> reshape in place -> resplit" (thats so far I understood this) also scales better or at least equal to the implementation in "main".

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 18, 2023

I can run the additional scaling tests

@ClaudiaComito
Copy link
Contributor Author

ClaudiaComito commented Apr 18, 2023

Hi @mrfh92 , thanks for the review, I actually ran the scaling tests of this implementation. I will run comparison tests with the released reshape (main branch) as well.

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 18, 2023

Strong scaling for a short-fat matrix and split=0,1

reshape_main_cpu
reshape_main_gpu
reshape_new_cpu
reshape_new_gpu

Strong scaling for different shapes from short-fat to tall-skinny (split=1 only)

reshape_main_cpu
reshape_main_gpu
reshape_new_cpu
reshape_new_gpu

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 18, 2023

It looks like that the "old" reshape is significantly slower than the "new" one (in particular for a small number of processes), but the "old" reshape scales well when increasing the number of processes; the "new" reshape does not seem to scale well (I believe that is due to the usage of resplit which will always have runtime proportional to the number of processes...)

Nevertheless, this does not necessarily imply that the "new" reshape is not an improvement compared to the "old" one; in contrary, the "new" version is that much more faster for small number of processes that it stil outperforms the "old" one there. However, there we should determine whether there is some break-even point at which the "old" one becomes better.

@mrfh92 mrfh92 assigned ClaudiaComito and unassigned mrfh92 Apr 19, 2023
@ClaudiaComito
Copy link
Contributor Author

It looks like that the "old" reshape is significantly slower than the "new" one (in particular for a small number of processes), but the "old" reshape scales well when increasing the number of processes; the "new" reshape does not seem to scale well (I believe that is due to the usage of resplit which will always have runtime proportional to the number of processes...)

Nevertheless, this does not necessarily imply that the "new" reshape is not an improvement compared to the "old" one; in contrary, the "new" version is that much more faster for small number of processes that it stil outperforms the "old" one there. However, there we should determine whether there is some break-even point at which the "old" one becomes better.

Thanks a lot @mrfh92 for all the tests. I went back and finished the tests I had started on the cluster, adding test runs on the release (main branch) version of reshape. I've updated the PR description with plots and more details on the experiments.

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Thank you for the PR!

@mrfh92 mrfh92 self-requested a review May 8, 2023 08:36
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

Thank you for the PR!

Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new plots in the PR description show that my worries on possible performance degradation w.r.t. scalability were actually unjustified. Therefore, I now recommend merging if the CI runs through.

@ClaudiaComito ClaudiaComito merged commit 8abbc16 into main May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reshape speedup
3 participants