Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CODE REVIEW ONLY] Host tree algorithms #12

Open
wants to merge 105 commits into
base: enh-json_code_reorg1
Choose a base branch
from

Conversation

karthikeyann
Copy link
Owner

@karthikeyann karthikeyann commented Sep 19, 2024

DO NOT MERGE.
For review purpose only.

shrshi and others added 29 commits August 13, 2024 15:37
This is the changes that will be in the cudf-polars point release.

---------

Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: David Wendt <[email protected]>
Co-authored-by: brandon-b-miller <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: brandon-b-miller <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Manas Singh <[email protected]>
Co-authored-by: Manas Singh <[email protected]>
@github-actions github-actions bot added the Java label Sep 24, 2024
karthikeyann and others added 28 commits September 24, 2024 16:03
This PR updates the update-version.sh script to use the packaging
library, given that setuptools is no longer included by default in
Python 3.12.
We recently pinned our `dask-expr` version to `1.1.14`: rapidsai/rapids-dask-dependency#64, that plus latest `dask` seems to be having a minimum requirement for `pyarrow` as `14.0.1`. This is causing failures in our CI matrix while running tests with the oldest dependencies. This PR bumps the minimum pyarrow version in our oldest deps.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#16883
…#16842)

Temporary workaround for dask/dask#11017 in Dask cuDF (when query-planning is enabled).
I will try to move this fix upstream soon. However, the next dask release will probably not be used by 24.10, and it's still unclear whether the same fix works for all CPU cases.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#16842
This CMake option was removed by rapidsai#15483.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: rapidsai#16879
rapidsai#16562)

This PR makes more on rapidsai#14975 by adding an environment variable that fails when fallback occurs in cudf.pandas. It also adds some tests that do __not__ fallback.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16562
…apidsai#16899)

See rapidsai#16895
Closes rapidsai#16892

Dask-expr uses `rename_axis`, which is not supported by cudf yet. This is a temporary workaround until rapidsai#16895 is resolved.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16899
…i#16886)

For releases, since the polars release cadence is quite a lot faster than rapids, we propose to hard-pin to a known good version. In this case, 1.8.x.

At the same time, remove pin in CI scripts and update list of xfailing tests in the polars test suite.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16886
This PR adds `cudf-polars` to the top level build script.

Authors:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#16898
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#16712
This PR is a first pass at rapidsai#15937. We will close rapidsai#15937 after rapidsai#15162 is closed

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16810
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
This PR displays delta's for CPU and GPU usage metrics that are extracted from `cudf.pandas` pytests.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#16864
…apidsai#15979)

Part of rapidsai#15903.
1. Introduces the Compressed Sparse Row (CSR) format to store the adjacency information of the column tree. 
2. Analogous to `reduce_to_column_tree`, `reduce_to_column_tree_csr` reduces node tree representation to column tree stored in CSR format.

TODO:
- [x] Correctness test

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: rapidsai#15979
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.