-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiIndex takes up a huge amount of storage space #5247
Comments
+1. ideally in pandas we'll refactor the Manager classes to not have the axes at all (xref pandas #48126) and subsequently modin partitions can hold Managers instead of DataFrames. |
Signed-off-by: Dmitry Chigarev <[email protected]>
As a temporary solution, we could try using I've made the following changes to the script above and measured for idx in range(count_cpus):
refs[idx] = ray.put(
- index[idx * one_part : (idx + 1) * one_part]
+ index[idx * one_part : (idx + 1) * one_part].remove_unused_levels()
) # it takes ~ 3210 MiB in storage
p.s. it also obviously works much faster cause putting lighter objects into the plasma store I'm wondering what are the disadvantages of using |
Real reason is most likely that it hasn't been suggested. Only downside that comes to mind is if you split a MultiIndex, then drop_unused_categories, then want to compare/concat/setop the results, it is more efficient to have known-matching-levels. |
Signed-off-by: Dmitry Chigarev <[email protected]>
The situation worsens when the flow code is filled with a large number of single-column inserts, resulting in a large number of column partitions, each of which stores its own version of the multiIndex.
Solutions to this problem can be:
Code to reproduce:
The text was updated successfully, but these errors were encountered: