-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5367: Introduce new API for repartitioning Modin objects #5366
FEAT-#5367: Introduce new API for repartitioning Modin objects #5366
Conversation
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
@dchigarev could you take a look? |
Signed-off-by: Myachev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
The only thing that I would propose to change is to move the function to modin.distributes.pandas
module, it seems that the module is more suitable for partition-like stuff.
Signed-off-by: Myachev <[email protected]>
af3ca93
to
66d97c9
Compare
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
_ax, lambda df: df, keep_partitioning=False | ||
) | ||
) | ||
return df.__constructor__(query_compiler=new_query_compiler) # type:ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy error: error: Call to untyped function "__constructor__" in typed context [no-untyped-call]
@dchigarev ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Although, want to hear more opinions from @modin-project/modin-core about this new API
Returns | ||
------- | ||
DataFrame or Series | ||
The repartitioned dataframe or series, depending on the original type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a section with some examples (what is bad partitioning, when one should use this function, what speed-up can re-partitioning bring) probably with a reference to according page in modin docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be in time for the next new release, I would prefer to do it separately.
Users sometimes ask why the performance is bad, first we can ask the user how many internal partitions and offer to use this feature if necessary. By improving the performance guide, they will be able to do it without us.
A couple of comments from my side.
|
I guess I can make this function as a private method for dataframe and series, just like |
I would say that the introduced API is some kind of a temporary hack for a user's code while we're dealing with an actual issue that caused problematic partitioning in their code. P.S. another concrete case where this API helps is #5296 (compare time measurements with "default" and "proper" partitioning).
Sure, this API doesn't imply that users have to master partitioning mechanism and play it on their own to fix their performance. As I said earlier, it's supposed to be a one-line hack that improves performance significantly right now, not weeks later when we merge an actual fix for the partitioning problem that a user has faced. |
Okay, I am rather for @modin-project/modin-core, other thoughts? |
56a8f07
to
6d06e94
Compare
Signed-off-by: Myachev <[email protected]>
6d06e94
to
0b81388
Compare
@dchigarev @YarShev ready for review |
Signed-off-by: Myachev <[email protected]>
1e9017e
to
7284fe8
Compare
Signed-off-by: Myachev <[email protected]>
7284fe8
to
1ff556f
Compare
@YarShev ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have PandasDataframePartitionManager.rebalance_partitions()
which is much more advanced, I wonder why aren't we using it here and what is the difference in this new API and that function?
It is not implemented for columns. |
Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: Iaroslav Igoshev <[email protected]>
Co-authored-by: Vasily Litvinov <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
- run: MODIN_BENCHMARK_MODE=True pytest modin/pandas/test/internals/test_benchmark_mode.py | ||
- run: pytest modin/pandas/test/internals/test_repartition.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has not been tested before when pushing, which affects the stability of codecov
results.
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: Iaroslav Igoshev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Myachev [email protected]
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date