Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): POC metadata reading and writing #17112

Merged
merged 1 commit into from
Jun 24, 2024

Conversation

coastalwhite
Copy link
Collaborator

This PR illustrates a basic workflow for keeping track of and utilizing metadata. Currently, this is still all behind POLARS_METADATA_USE=experimental, but now we can start implementing a lot of places where metadata can be inferred or used.

In this PR, I implemented all the infrastructure necessary to write the metadata from chunked_array::ops and to read the metadata from polars-expr.

The writing is done using interior mutability. I attempted several approaches and this is the one that I settled on as being the easiest to work with and not providing to much overhead.

The reading is done through a trait object over the MetadataTrait, this makes all data available as Scalars.

The follow snippet illustrates that it works.

import polars as pl
import numpy as np
import os
from timeit import timeit

def many_mins():
    df = pl.DataFrame({ 'a': pl.Series(np.arange(500000)) })

    for i in range(10):
        df = df.with_columns(amin = pl.col.a.min())

os.environ["POLARS_METADATA_USE"] = "experimental"
with_md = timeit(many_mins, number=100)
os.environ["POLARS_METADATA_USE"] = "1"
without_md = timeit(many_mins, number=100)

print(f'without md = {without_md}')
print(f'with md    = {with_md}')

This outputs:

without md = 4.232960837998689
with md    = 0.6040126650004822

This PR illustrates a basic workflow for keeping track of and utilizing metadata. Currently, this is still all behind `POLARS_METADATA_USE=experimental`, but now we can start implementing a lot of places where metadata can be inferred or used.

In this PR, I implemented all the infrastructure necessary to write the metadata from `chunked_array::ops` and to read the metadata from `polars-expr`.

The writing is done using interior mutability. I attempted several approaches and this is the one that I settled on as being the easiest to work with and not providing to much overhead.

The reading is done through a trait object over the `MetadataTrait`, this makes all data available as `Scalar`s.

The follow snippet illustrates that it works.

```python
import polars as pl
import numpy as np
import os
from timeit import timeit

def many_mins():
    df = pl.DataFrame({ 'a': pl.Series(np.arange(500000)) })

    for i in range(10):
        df = df.with_columns(amin = pl.col.a.min())

os.environ["POLARS_METADATA_USE"] = "experimental"
with_md = timeit(many_mins, number=100)
os.environ["POLARS_METADATA_USE"] = "1"
without_md = timeit(many_mins, number=100)

print(f'without md = {without_md}')
print(f'with md    = {with_md}')
```

This outputs:

```
without md = 4.232960837998689
with md    = 0.6040126650004822
```
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jun 21, 2024
@ritchie46 ritchie46 merged commit b60788d into pola-rs:main Jun 24, 2024
22 checks passed
@coastalwhite coastalwhite deleted the md-proof-of-concept branch June 24, 2024 11:12
@c-peters c-peters added the accepted Ready for implementation label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants