Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Don't initiate the ScalarMemoTable when it isn't needed #40316

Closed
anjakefala opened this issue Mar 1, 2024 · 1 comment
Closed

[Python] Don't initiate the ScalarMemoTable when it isn't needed #40316

anjakefala opened this issue Mar 1, 2024 · 1 comment

Comments

@anjakefala
Copy link
Collaborator

Describe the bug, including details regarding any error messages, version, and platform.

This is the use-case for why this is important: #40301

The memo table is only used if deduplicate_objects=True, but it gets instantiated regardless.

Component(s)

Python

@anjakefala anjakefala changed the title Don't initiate the ScalarMemoTable when it isn't needed [Python] Don't initiate the ScalarMemoTable when it isn't needed Mar 1, 2024
@anjakefala anjakefala self-assigned this Mar 8, 2024
pitrou added a commit that referenced this issue Mar 20, 2024
### Rationale for this change

Mimalloc and jemalloc can allocate a [relatively large amount of memory for the ScalarMemoTable](#40301). For this reason, the ScalarMemoTable should only be allocated when it is used (when `options.deduplicate_objects=True`).

I tested this change, and for small tables it does improve memory allocation.

`options.deduplicate_objects=False`

After this change:

📦 Total memory allocated:
        174.422MB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 6.000B   :  3064 ▇▇
        < 36.000B  :  7533 ▇▇▇▇
        < 222.000B :  9974 ▇▇▇▇▇
        < 1.319KB  : 53264 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 7.999KB  :  5188 ▇▇▇
        < 48.503KB :   742 ▇
        < 294.066KB:   102 ▇
        < 1.741MB  :    22 ▇
        < 10.556MB :     1 ▇
        <=64.000MB :     1 ▇
        --------------------------------------------
        max: 64.000MB

Before this change:

📦 Total memory allocated:
        1.295GB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 6.000B   :  3064 ▇▇
        < 36.000B  :  7543 ▇▇▇▇
        < 222.000B : 10009 ▇▇▇▇▇
        < 1.319KB  : 53269 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 7.999KB  :  5192 ▇▇▇
        < 48.503KB :   761 ▇
        < 294.066KB:   102 ▇
        < 1.741MB  :    22 ▇
        < 10.556MB :     1 ▇
        <=64.000MB :    19 ▇
        --------------------------------------------
        max: 64.000MB

### What changes are included in this PR?

The allocation of `memo_table` and `unique_values` have been moved underneath an `if (options.deduplicate_objects)` block. Since they are used within a lambda, they have been changed to shared pointers, so that their values exist for the lifetime needed.

### Are these changes tested?

`deduplicate_objects` has extensive existing tests: https://github.com/apache/arrow/blob/b235f83ed10bcad174b267113479a24ca045def5/python/pyarrow/tests/test_pandas.py#L3211 and https://github.com/apache/arrow/blob/b235f83ed10bcad174b267113479a24ca045def5/python/benchmarks/convert_pandas.py#L71

### Are there any user-facing changes?

Nope.
* GitHub Issue: #40316

Lead-authored-by: anjakefala <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 16.0.0 milestone Mar 20, 2024
@pitrou
Copy link
Member

pitrou commented Mar 20, 2024

Issue resolved by pull request 40565
#40565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants