Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Narwhals to support Polars and Modin DataFrames #341

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mwouts
Copy link
Owner

@mwouts mwouts commented Dec 1, 2024

This PR add tests for Modin dataframes. It is on top of @DeaMariaLeon's PR to add support for more DataFrame types using Narwhals at #339 .

From this quick experiment I would say that the nbytes function (estimate the size of the dataframe in bytes) and the downsample functions (keep left and right most columns, top and bottom rows) could benefit a bit more from Narwhals.

Copy link

github-actions bot commented Dec 1, 2024

Thank you for making this pull request.

Did you know? You can try it on Binder: Binder:lab.

Also, the version of ITables developed in this PR can be installed with pip:

pip install git+https://github.com/mwouts/itables.git@use_narwhals

(this requires nodejs, see more at Developing ITables)

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2024

Codecov Report

Attention: Patch coverage is 84.47205% with 25 lines in your changes missing coverage. Please review.

Project coverage is 91.95%. Comparing base (a3dfd99) to head (f3d81a2).

Files with missing lines Patch % Lines
tests/test_ibis.py 27.77% 13 Missing ⚠️
src/itables/sample_dfs.py 79.31% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #341      +/-   ##
==========================================
- Coverage   94.76%   91.95%   -2.81%     
==========================================
  Files          24       26       +2     
  Lines        1280     1343      +63     
==========================================
+ Hits         1213     1235      +22     
- Misses         67      108      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Remove support for Python 3.7
@mwouts mwouts changed the title WIP: Add test for Modin dataframes Use Narwhals to support Polars and Modin DataFrames Dec 15, 2024
@mwouts mwouts added this to the 2.4 milestone Dec 15, 2024
@mwouts mwouts marked this pull request as ready for review December 15, 2024 21:30
@mwouts mwouts force-pushed the use_narwhals branch 2 times, most recently from d226035 to ae00927 Compare December 15, 2024 22:31
@mwouts
Copy link
Owner Author

mwouts commented Dec 26, 2024

@DeaMariaLeon @MarcoGorelli I hope to finalize this PR in the coming days (sorry for not being able to find time for that earlier!)

I plan to

  • update the docs to mention the fact that we use narwhals to support the other types of dataframes
  • test a few more types of dataframes
  • see if I can do something about the performance1 when displaying Modin dataframes

Footnotes

  1. Running the tests on the modin dataframes takes twice longer, and some dataframes take a long time to display, see e.g. the "wide" dataframe in the documentation that takes 318ms when using Pandas, 68ms when using Polars (through Narwhals), and 3.54s when using Modin!!
    image

@mwouts
Copy link
Owner Author

mwouts commented Dec 26, 2024

I can't seem to be able to use Narwhals for rendering Ibis dataframes - at least I can't call len(nw.from_native(...)) on either

t = ibis.table(dict(one="string", two="float", three="int32"), name="my_data")

or

t = ibis.memtable(pd.DataFrame(
    [["a", 1, 2], ["b", 3, 4]],
    columns=["one", "two", "three"],
    index=[5, 6],
), name="t")

The error I get is the following:

NotImplementedError: Attribute __len__ is not supported for metadata-only dataframes.

If you would like to see this kind of object better supported in Narwhals, please open a feature request at https://github.com/narwhals-dev/narwhals/issues.

@MarcoGorelli
Copy link

MarcoGorelli commented Dec 26, 2024

hey @mwouts - at the moment, our support for Ibis is only "interchange-level" (https://narwhals-dev.github.io/narwhals/extending/#interchange-only-support). This means you can only do the following:

  • .schema, hence column names via .schema.names() and column types via .schema.dtypes()
    -.columns
  • .to_pandas() and .to_arrow(), for converting to Pandas and Arrow, respectively.
  • .select(names) (Ibis and DuckDB), where names is a list of (string) column names. This is useful for selecting columns before converting to another library.

The plan is to implement lazy support for them relatively soon in early 2025: narwhals-dev/narwhals#1657. We've got something cooking for PySpark and DuckDB, but we'll do the same for Ibis too. However, even once we have Ibis supported, calling len on a narwhals.LazyFrame won't be supported. Indeed, it's not even supported for ibis.Table, as they follow a deferred execution model

What you could do now (and this is what Marimo do) is:

df = nw.from_native(df_native)
if nw.get_level(df) == 'interchange':
    df = df.to_pandas()  # or df.to_arrow(), if you support PyArrow

and then continue with the rest of the logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants