Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets) Add VerticalEvenPartitioner #4692

Merged
merged 21 commits into from
Dec 20, 2024

Conversation

adam-narozniak
Copy link
Contributor

@adam-narozniak adam-narozniak commented Dec 12, 2024

A new class that produces even or approximately even number of partitions (in terms of number of columns).

This class can't be a subclass of VerticalSizePartitioner due to the inability of accurate fraction (neither count) specification of sizes. Why?
e.g.

n_cols = 104
columns = [f"{i}" for i in range(n_cols)]
dataset = _create_dummy_dataset(columns, num_rows=50)
# If using the subclassing
partitioner = VerticalEvenPartitioner(num_partitions=100, shuffle=False)
partitioner.dataset = dataset
pd.Series(
    [partitioner.load_partition(i).num_columns for i in range(100)]
).value_counts()
# 1    99
# 5     1
# Name: count, dtype: int64

And here if using a new class

1    96
2     4
Name: count, dtype: int64
0     2
1     2
2     2
3     2
4     1
     ..
95    1
96    1
97    1

So why it is not possible to inherit?
an operation that creates either counts of fractions needs to happen at the init time (to pass the args to SizePartitinoer) However due to the late assignment of the dataset it's not possible to use count neither to provide a better way of assign fractions that [1/num_partitions] * num_partitions and the assign the remainder columns to the signed object.

@adam-narozniak adam-narozniak changed the title feat(datasets) Add EvenVerticalPartitioner feat(datasets) Add VerticalEvenPartitioner Dec 13, 2024
jafermarq
jafermarq previously approved these changes Dec 19, 2024
@jafermarq jafermarq enabled auto-merge (squash) December 19, 2024 16:20
@jafermarq jafermarq disabled auto-merge December 20, 2024 06:51
@jafermarq jafermarq merged commit 8ec601e into main Dec 20, 2024
61 checks passed
@jafermarq jafermarq deleted the fds-add-vertical-even-partitioner branch December 20, 2024 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants