Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

Merged
merged 21 commits into from
Dec 10, 2022

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Dec 6, 2022

Signed-off-by: Myachev [email protected]

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves FEAT: create API for repartitioning Modin objects #5367
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev changed the title FEAT-#0000: introduce repartition api for modin.utils FEAT-#5367: introduce new API for repartitioning Modin objects Dec 6, 2022
Signed-off-by: Myachev <[email protected]>
@anmyachev
Copy link
Collaborator Author

@dchigarev could you take a look?

Signed-off-by: Myachev <[email protected]>
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

The only thing that I would propose to change is to move the function to modin.distributes.pandas module, it seems that the module is more suitable for partition-like stuff.

@anmyachev anmyachev force-pushed the add-repartition-api branch from af3ca93 to 66d97c9 Compare December 6, 2022 20:33
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
_ax, lambda df: df, keep_partitioning=False
)
)
return df.__constructor__(query_compiler=new_query_compiler) # type:ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy error: error: Call to untyped function "__constructor__" in typed context [no-untyped-call]

@anmyachev anmyachev marked this pull request as ready for review December 6, 2022 21:51
@anmyachev anmyachev requested a review from a team as a code owner December 6, 2022 21:51
@anmyachev
Copy link
Collaborator Author

@dchigarev ready for review

dchigarev
dchigarev previously approved these changes Dec 7, 2022
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Although, want to hear more opinions from @modin-project/modin-core about this new API

Comment on lines 267 to 270
Returns
-------
DataFrame or Series
The repartitioned dataframe or series, depending on the original type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a section with some examples (what is bad partitioning, when one should use this function, what speed-up can re-partitioning bring) probably with a reference to according page in modin docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be in time for the next new release, I would prefer to do it separately.

Users sometimes ask why the performance is bad, first we can ask the user how many internal partitions and offer to use this feature if necessary. By improving the performance guide, they will be able to do it without us.

@YarShev
Copy link
Collaborator

YarShev commented Dec 7, 2022

A couple of comments from my side.

  1. Is there an example where we would benefit with this new API?
  2. from_partitions and unwrap_partitions were originaly added for those developing ML stuff to be able to retrieve underlying partitions to pass those in to ML parts. I am not sure that the new repartition API is nice to those who use pandas API and want to move from it to Modin. That would make it difficult to move from pandas to Modin.
  3. I suppose repartitioning should happen transperantly to users.

@anmyachev
Copy link
Collaborator Author

anmyachev commented Dec 7, 2022

A couple of comments from my side.

  1. Is there an example where we would benefit with this new API?
  2. from_partitions and unwrap_partitions were originaly added for those developing ML stuff to be able to retrieve underlying partitions to pass those in to ML parts. I am not sure that the new repartition API is nice to those who use pandas API and want to move from it to Modin. That would make it difficult to move from pandas to Modin.
  3. I suppose repartitioning should happen transperantly to users.
  1. This should help in case of MultiIndex takes up a huge amount of storage space #5247.
  2. Are you suggesting moving it to another location? Any suggestion where?
  3. We don't have implicit repartitioning along the column axis yet. Also, if we implement repartitioning in the same way as for the row axis, I'm not sure if this will help in the case described in MultiIndex takes up a huge amount of storage space #5247.

I guess I can make this function as a private method for dataframe and series, just like _to_pandas.

@dchigarev
Copy link
Collaborator

dchigarev commented Dec 7, 2022

@YarShev

  1. Is there an example where we would benefit with this new API?

I would say that the introduced API is some kind of a temporary hack for a user's code while we're dealing with an actual issue that caused problematic partitioning in their code.

P.S. another concrete case where this API helps is #5296 (compare time measurements with "default" and "proper" partitioning).

  1. I suppose repartitioning should happen transperantly to users.

Sure, this API doesn't imply that users have to master partitioning mechanism and play it on their own to fix their performance. As I said earlier, it's supposed to be a one-line hack that improves performance significantly right now, not weeks later when we merge an actual fix for the partitioning problem that a user has faced.

@YarShev
Copy link
Collaborator

YarShev commented Dec 7, 2022

Okay, I am rather for <df/s>._repartition() as a temporary solution for the mentioned problems. Can you also create an issue for partitioning problem, which would happen transperantly to users in necessary cases.

@modin-project/modin-core, other thoughts?

@anmyachev anmyachev force-pushed the add-repartition-api branch 3 times, most recently from 56a8f07 to 6d06e94 Compare December 7, 2022 13:52
@anmyachev anmyachev force-pushed the add-repartition-api branch from 6d06e94 to 0b81388 Compare December 7, 2022 13:53
@anmyachev
Copy link
Collaborator Author

@dchigarev @YarShev ready for review

@anmyachev anmyachev force-pushed the add-repartition-api branch from 1e9017e to 7284fe8 Compare December 7, 2022 15:05
@anmyachev anmyachev force-pushed the add-repartition-api branch from 7284fe8 to 1ff556f Compare December 7, 2022 15:06
@anmyachev
Copy link
Collaborator Author

@YarShev ready for review

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have PandasDataframePartitionManager.rebalance_partitions() which is much more advanced, I wonder why aren't we using it here and what is the difference in this new API and that function?

@anmyachev
Copy link
Collaborator Author

We have PandasDataframePartitionManager.rebalance_partitions() which is much more advanced, I wonder why aren't we using it here and what is the difference in this new API and that function?

It is not implemented for columns.
In addition, the repartition condition is also not suitable for workflow that we speed up (repartition occurs when there are 1.5 times more partitions). Due to the lack of such functionality, I have to temporarily use the following hack: pd.DataFrame(try_cast_to_pandas(df)).

anmyachev and others added 5 commits December 8, 2022 18:45
Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: Iaroslav Igoshev <[email protected]>
Co-authored-by: Vasily Litvinov <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Comment on lines +282 to +283
- run: MODIN_BENCHMARK_MODE=True pytest modin/pandas/test/internals/test_benchmark_mode.py
- run: pytest modin/pandas/test/internals/test_repartition.py
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been tested before when pushing, which affects the stability of codecov results.

anmyachev and others added 2 commits December 9, 2022 12:48
Co-authored-by: Iaroslav Igoshev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@YarShev YarShev changed the title FEAT-#5367: introduce new API for repartitioning Modin objects FEAT-#5367: Introduce new API for repartitioning Modin objects Dec 9, 2022
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vnlitvinov vnlitvinov merged commit 704ded9 into modin-project:master Dec 10, 2022
@anmyachev anmyachev deleted the add-repartition-api branch March 24, 2023 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEAT: create API for repartitioning Modin objects
4 participants