FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

anmyachev · 2022-12-06T18:35:55Z

Signed-off-by: Myachev [email protected]

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves FEAT: create API for repartitioning Modin objects #5367
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Myachev <[email protected]>

anmyachev · 2022-12-06T19:51:36Z

@dchigarev could you take a look?

Signed-off-by: Myachev <[email protected]>

dchigarev

Overall looks good!

The only thing that I would propose to change is to move the function to modin.distributes.pandas module, it seems that the module is more suitable for partition-like stuff.

modin/utils.py

Signed-off-by: Myachev <[email protected]>

anmyachev · 2022-12-06T21:51:08Z

modin/distributed/dataframe/pandas/partitions.py

+                _ax, lambda df: df, keep_partitioning=False
+            )
+        )
+    return df.__constructor__(query_compiler=new_query_compiler)  # type:ignore


mypy error: error: Call to untyped function "__constructor__" in typed context [no-untyped-call]

anmyachev · 2022-12-07T10:02:36Z

@dchigarev ready for review

dchigarev

Looks good to me.

Although, want to hear more opinions from @modin-project/modin-core about this new API

dchigarev · 2022-12-07T10:12:12Z

modin/distributed/dataframe/pandas/partitions.py

+    Returns
+    -------
+    DataFrame or Series
+        The repartitioned dataframe or series, depending on the original type.


can we add a section with some examples (what is bad partitioning, when one should use this function, what speed-up can re-partitioning bring) probably with a reference to according page in modin docs

In order to be in time for the next new release, I would prefer to do it separately.

Users sometimes ask why the performance is bad, first we can ask the user how many internal partitions and offer to use this feature if necessary. By improving the performance guide, they will be able to do it without us.

YarShev · 2022-12-07T12:09:08Z

A couple of comments from my side.

Is there an example where we would benefit with this new API?
from_partitions and unwrap_partitions were originaly added for those developing ML stuff to be able to retrieve underlying partitions to pass those in to ML parts. I am not sure that the new repartition API is nice to those who use pandas API and want to move from it to Modin. That would make it difficult to move from pandas to Modin.
I suppose repartitioning should happen transperantly to users.

anmyachev · 2022-12-07T12:27:09Z

A couple of comments from my side.

Is there an example where we would benefit with this new API?

from_partitions and unwrap_partitions were originaly added for those developing ML stuff to be able to retrieve underlying partitions to pass those in to ML parts. I am not sure that the new repartition API is nice to those who use pandas API and want to move from it to Modin. That would make it difficult to move from pandas to Modin.

I suppose repartitioning should happen transperantly to users.

This should help in case of MultiIndex takes up a huge amount of storage space #5247.
Are you suggesting moving it to another location? Any suggestion where?
We don't have implicit repartitioning along the column axis yet. Also, if we implement repartitioning in the same way as for the row axis, I'm not sure if this will help in the case described in MultiIndex takes up a huge amount of storage space #5247.

I guess I can make this function as a private method for dataframe and series, just like _to_pandas.

dchigarev · 2022-12-07T12:36:00Z

@YarShev

Is there an example where we would benefit with this new API?

I would say that the introduced API is some kind of a temporary hack for a user's code while we're dealing with an actual issue that caused problematic partitioning in their code.

P.S. another concrete case where this API helps is #5296 (compare time measurements with "default" and "proper" partitioning).

I suppose repartitioning should happen transperantly to users.

Sure, this API doesn't imply that users have to master partitioning mechanism and play it on their own to fix their performance. As I said earlier, it's supposed to be a one-line hack that improves performance significantly right now, not weeks later when we merge an actual fix for the partitioning problem that a user has faced.

YarShev · 2022-12-07T13:18:01Z

Okay, I am rather for <df/s>._repartition() as a temporary solution for the mentioned problems. Can you also create an issue for partitioning problem, which would happen transperantly to users in necessary cases.

@modin-project/modin-core, other thoughts?

Signed-off-by: Myachev <[email protected]>

anmyachev · 2022-12-07T13:56:56Z

@dchigarev @YarShev ready for review

modin/pandas/base.py

Signed-off-by: Myachev <[email protected]>

anmyachev · 2022-12-07T15:11:52Z

@YarShev ready for review

vnlitvinov

We have PandasDataframePartitionManager.rebalance_partitions() which is much more advanced, I wonder why aren't we using it here and what is the difference in this new API and that function?

modin/pandas/base.py

anmyachev · 2022-12-07T22:32:17Z

We have PandasDataframePartitionManager.rebalance_partitions() which is much more advanced, I wonder why aren't we using it here and what is the difference in this new API and that function?

It is not implemented for columns.
In addition, the repartition condition is also not suitable for workflow that we speed up (repartition occurs when there are 1.5 times more partitions). Due to the lack of such functionality, I have to temporarily use the following hack: pd.DataFrame(try_cast_to_pandas(df)).

modin/pandas/test/internals/test_repartition.py

modin/pandas/base.py

modin/core/storage_formats/base/query_compiler.py

Signed-off-by: Anatoly Myachev <[email protected]>

Co-authored-by: Iaroslav Igoshev <[email protected]>

Co-authored-by: Vasily Litvinov <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2022-12-08T17:56:10Z

.github/workflows/push.yml

+      - run: MODIN_BENCHMARK_MODE=True pytest modin/pandas/test/internals/test_benchmark_mode.py
+      - run: pytest modin/pandas/test/internals/test_repartition.py


This has not been tested before when pushing, which affects the stability of codecov results.

Signed-off-by: Anatoly Myachev <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/storage_formats/base/query_compiler.py

modin/pandas/base.py

modin/pandas/test/internals/test_repartition.py

modin/core/storage_formats/base/query_compiler.py

Co-authored-by: Iaroslav Igoshev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

vnlitvinov

LGTM

anmyachev added 2 commits December 6, 2022 19:35

FEAT-#0000: introduce repartition api for modin.utils

ed9ebe5

Signed-off-by: Myachev <[email protected]>

add test and improve func

a8ed713

Signed-off-by: Myachev <[email protected]>

anmyachev changed the title ~~FEAT-#0000: introduce repartition api for modin.utils~~ FEAT-#5367: introduce new API for repartitioning Modin objects Dec 6, 2022

fixes

b08fcdd

Signed-off-by: Myachev <[email protected]>

mute mypy

a58a7f3

Signed-off-by: Myachev <[email protected]>

dchigarev reviewed Dec 6, 2022

View reviewed changes

modin/utils.py Outdated Show resolved Hide resolved

modin/utils.py Outdated Show resolved Hide resolved

move repartition into modin.distributed.dataframe.pandas module

66d97c9

Signed-off-by: Myachev <[email protected]>

anmyachev force-pushed the add-repartition-api branch from af3ca93 to 66d97c9 Compare December 6, 2022 20:33

anmyachev added 4 commits December 6, 2022 21:51

fix

103cc24

Signed-off-by: Myachev <[email protected]>

fix

54c0597

Signed-off-by: Myachev <[email protected]>

add fast path for hdk

09e99e9

Signed-off-by: Myachev <[email protected]>

add note into docs

77e5199

Signed-off-by: Myachev <[email protected]>

anmyachev commented Dec 6, 2022

View reviewed changes

anmyachev marked this pull request as ready for review December 6, 2022 21:51

anmyachev requested a review from a team as a code owner December 6, 2022 21:51

anmyachev added the Ready for review label Dec 6, 2022

dchigarev previously approved these changes Dec 7, 2022

View reviewed changes

anmyachev dismissed dchigarev’s stale review via 865b746 December 7, 2022 13:47

anmyachev force-pushed the add-repartition-api branch 3 times, most recently from 56a8f07 to 6d06e94 Compare December 7, 2022 13:52

make repartition as internal method of DataFrame and Series

0b81388

Signed-off-by: Myachev <[email protected]>

anmyachev force-pushed the add-repartition-api branch from 6d06e94 to 0b81388 Compare December 7, 2022 13:53

YarShev reviewed Dec 7, 2022

View reviewed changes

modin/pandas/base.py Show resolved Hide resolved

modin/pandas/base.py Outdated Show resolved Hide resolved

modin/pandas/base.py Outdated Show resolved Hide resolved

address review comments

6ba2dea

Signed-off-by: Myachev <[email protected]>

anmyachev force-pushed the add-repartition-api branch from 1e9017e to 7284fe8 Compare December 7, 2022 15:05

address review comments

1ff556f

Signed-off-by: Myachev <[email protected]>

anmyachev force-pushed the add-repartition-api branch from 7284fe8 to 1ff556f Compare December 7, 2022 15:06

vnlitvinov reviewed Dec 7, 2022

View reviewed changes

modin/pandas/base.py Outdated Show resolved Hide resolved

modin/pandas/base.py Outdated Show resolved Hide resolved

modin/pandas/base.py Outdated Show resolved Hide resolved

YarShev reviewed Dec 8, 2022

View reviewed changes

anmyachev and others added 5 commits December 8, 2022 18:45

use test_repartition.py in CI

050749a

Signed-off-by: Anatoly Myachev <[email protected]>

Apply suggestions from code review

830fa08

Co-authored-by: Iaroslav Igoshev <[email protected]>

Update modin/pandas/base.py

e3ae362

Co-authored-by: Vasily Litvinov <[email protected]>

address review comments

8f67b13

Signed-off-by: Anatoly Myachev <[email protected]>

fixes

66b0e12

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Dec 8, 2022

View reviewed changes

anmyachev added 2 commits December 8, 2022 18:57

doc fixes

1092b3e

Signed-off-by: Anatoly Myachev <[email protected]>

align amount of partitions with the rest of test files

f101c6b

Signed-off-by: Anatoly Myachev <[email protected]>

YarShev reviewed Dec 9, 2022

View reviewed changes

anmyachev commented Dec 9, 2022

View reviewed changes

modin/core/storage_formats/base/query_compiler.py Outdated Show resolved Hide resolved

anmyachev and others added 2 commits December 9, 2022 12:48

Apply suggestions from code review

37c7db6

Co-authored-by: Iaroslav Igoshev <[email protected]>

add explaining of axis parameter

72552df

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Dec 9, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Show resolved Hide resolved

YarShev changed the title ~~FEAT-#5367: introduce new API for repartitioning Modin objects~~ FEAT-#5367: Introduce new API for repartitioning Modin objects Dec 9, 2022

YarShev approved these changes Dec 9, 2022

View reviewed changes

vnlitvinov approved these changes Dec 10, 2022

View reviewed changes

vnlitvinov merged commit 704ded9 into modin-project:master Dec 10, 2022

anmyachev deleted the add-repartition-api branch March 24, 2023 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

anmyachev commented Dec 6, 2022 •

edited

Loading

anmyachev commented Dec 6, 2022

dchigarev left a comment

anmyachev Dec 6, 2022

anmyachev commented Dec 7, 2022

dchigarev left a comment

dchigarev Dec 7, 2022

anmyachev Dec 7, 2022

YarShev commented Dec 7, 2022

anmyachev commented Dec 7, 2022 •

edited

Loading

dchigarev commented Dec 7, 2022 •

edited

Loading

YarShev commented Dec 7, 2022

anmyachev commented Dec 7, 2022

anmyachev commented Dec 7, 2022

vnlitvinov left a comment

anmyachev commented Dec 7, 2022

anmyachev Dec 8, 2022

vnlitvinov left a comment

		- run: MODIN_BENCHMARK_MODE=True pytest modin/pandas/test/internals/test_benchmark_mode.py
		- run: pytest modin/pandas/test/internals/test_repartition.py

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

Conversation

anmyachev commented Dec 6, 2022 • edited Loading

What do these changes do?

anmyachev commented Dec 6, 2022

dchigarev left a comment

Choose a reason for hiding this comment

anmyachev Dec 6, 2022

Choose a reason for hiding this comment

anmyachev commented Dec 7, 2022

dchigarev left a comment

Choose a reason for hiding this comment

dchigarev Dec 7, 2022

Choose a reason for hiding this comment

anmyachev Dec 7, 2022

Choose a reason for hiding this comment

YarShev commented Dec 7, 2022

anmyachev commented Dec 7, 2022 • edited Loading

dchigarev commented Dec 7, 2022 • edited Loading

YarShev commented Dec 7, 2022

anmyachev commented Dec 7, 2022

anmyachev commented Dec 7, 2022

vnlitvinov left a comment

Choose a reason for hiding this comment

anmyachev commented Dec 7, 2022

anmyachev Dec 8, 2022

Choose a reason for hiding this comment

vnlitvinov left a comment

Choose a reason for hiding this comment

anmyachev commented Dec 6, 2022 •

edited

Loading

anmyachev commented Dec 7, 2022 •

edited

Loading

dchigarev commented Dec 7, 2022 •

edited

Loading