Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Groupby pct_change #10444

Closed

Conversation

skirui-source
Copy link
Contributor

Subsequent to #9805, this PR adds support for Groupby.pct_change()
Fixes #9606

@skirui-source skirui-source added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Mar 16, 2022
@skirui-source skirui-source self-assigned this Mar 16, 2022
@codecov
Copy link

codecov bot commented Mar 16, 2022

Codecov Report

Merging #10444 (81332d4) into branch-22.06 (c6260eb) will decrease coverage by 0.03%.
The diff coverage is 11.11%.

❗ Current head 81332d4 differs from pull request most recent head 68fa3c7. Consider uploading reports for the commit 68fa3c7 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10444      +/-   ##
================================================
- Coverage         86.33%   86.30%   -0.04%     
================================================
  Files               144      140       -4     
  Lines             22681    22309     -372     
================================================
- Hits              19581    19253     -328     
+ Misses             3100     3056      -44     
Impacted Files Coverage Δ
python/cudf/cudf/core/groupby/groupby.py 90.09% <11.11%> (-1.71%) ⬇️
python/cudf/cudf/utils/cudautils.py 59.83% <0.00%> (-5.91%) ⬇️
python/dask_cudf/dask_cudf/accessors.py 92.30% <0.00%> (-3.85%) ⬇️
python/cudf/cudf/utils/dtypes.py 82.30% <0.00%> (-1.99%) ⬇️
python/cudf/cudf/core/column/lists.py 90.49% <0.00%> (-1.21%) ⬇️
python/dask_cudf/dask_cudf/tests/test_accessor.py 98.41% <0.00%> (-0.79%) ⬇️
python/dask_cudf/dask_cudf/groupby.py 96.66% <0.00%> (-0.77%) ⬇️
python/cudf/cudf/io/dlpack.py 82.60% <0.00%> (-0.73%) ⬇️
python/cudf/cudf/core/column/column.py 89.30% <0.00%> (-0.47%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.31% <0.00%> (-0.40%) ⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9742260...68fa3c7. Read the comment docs.

@skirui-source skirui-source changed the base branch from branch-22.04 to branch-22.06 March 30, 2022 00:42
@github-actions
Copy link

github-actions bot commented May 1, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

skirui-source and others added 13 commits May 1, 2022 15:24
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
nvdbaranec and others added 18 commits June 21, 2022 21:36
…apidsai#11108)

String data that contained \r and \n characters would print badly mangled.  

Example column containing:
```
.,e,Infinity,+Infinity,-Infinity,+nAn,-naN,Nan,5f,1.2f,\riNf,NULL
```

Doing a `cudf::test::print()` on this would yield mostly garbage, but importantly, plausible-seeming but puzzling garbage:

```
iNf,NULLnity,+Infinity,-Infinity,+nAn,-naN,Nan,5f,1.2f,
```

The fix is to do a little postprocessing and replace the individual chars `'\r'` and `'\n'` with the escaped strings `"\r"` and `"\n"`.  Note that this only applies to the output ultimately sent to the print - not to the raw data retrieved from the device itself using `::to_host`

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#11108
Spent some time investigating performance on column to row and row to column conversion code. Lots of wins here. Split PR up, so this is column to row work only. Row to column to follow. Also, planning to add tests and benchmarks into `spark-rapids-jni` once this is all updated.

```
Comparing 22.08/ROW_CONVERSION_BENCH to mwilson/column_to_row_optimization/ROW_CONVERSION_BENCH
Benchmark                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/1048576/manual_time                              +0.0012         +0.0026             5             5             5             5
RowConversion/new_to_row_conversion/1048576/manual_time                              -0.5661         -0.5652            16             7            16             7
RowConversion/new_to_row_extended_conversion/1048576/manual_time                     -0.5680         -0.5655            16             7            16             7
RowConversion/string_to_row_extended_conversion/1048576/manual_time                  -0.5768         -0.5764            27            12            27            12
RowConversion/old_from_row_conversion/1048576/manual_time                            -0.0017         -0.0018             4             4             4             4
RowConversion/new_from_row_conversion/1048576/manual_time                            +1.1629         +1.1619            31            68            32            68
RowConversion/new_from_row_extended_conversion/1048576/manual_time                   +1.1540         +1.1549            31            68            32            68
RowConversion/string_from_row_extended_conversion/1048576/manual_time                +0.7185         +0.6831            39            67            40            68

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#11063
This PR removes the outdated files `.dockerignore` and `MANIFEST.in`. The Dockerfile was removed in rapidsai#10069 so we don't need to keep the ignore file around anymore. The `MANIFEST.in` is very old (last updated in 2018, cad899b) and no longer correct for the current package structure.

Finally, I moved `.clang-tidy` into the `cpp/` directory. @codereport said that would be fine to clean up from the root and have no negative effects since clang-tidy isn't being run yet.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#11124
…fault_stream_value (rapidsai#11082)

Closes rapidsai#10864.
Also closes rapidsai#9614.

This PR is a follow-up to rapidsai#10877.  It replaces all of the remaining instances of `rmm::cuda_stream_default` with `cudf::default_stream_value`.

There are a lot of replacements and addition of includes, along with some reformatting due to clang-format, but like rapidsai#10877, there should be no noticeable functional change here.

Authors:
  - Jim Brennan (https://github.com/jbrennan333)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#11082
…sai#11132)

The developer guide references a `scalar_view` class that does not exist. This PR removes that reference.

See rapidsai#6558 for the rationale of why no such class exists.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec

URL: rapidsai#11132
Needed for Sphinx 5 compatibility. Should fix the following warning occurring in doc builds.

```
Invalid configuration value found: 'language = None'. Update your configuration to a valid langauge code. Falling back to 'en' (English).
```

xref: sphinx-doc/sphinx#10481

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/jakirkham
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#11133
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance.

A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in rapidsai#9417 but didn't get in eventually.

Partially addresses rapidsai#11050 and rapidsai#11053.

----
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47

Follow up work:
 * rapidsai#11053
 * rapidsai#11089
 * rapidsai#11092

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#11052
This PR resolves failures related to `__cuda_array_interface__` which were not being run for a very long time.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#11113
This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not.

Depends on rapidsai#11052 (built on top of it).
Closes rapidsai#11092.
This is a blocker for set-like operations (rapidsai#11043) and also the last blocker for rapidsai#11053.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#11118
This is just a not-very-simple refactor to `lists::contains`:
 * Remove some wrong lines in doxygen of `lists/contains.hpp`, and rewrite doxygen there a little bit.
 * Add more comments to the code.
 * Reduce the number of code paths of the template struct functors.
 * Rename some/many variables, and reorganize code to make it cleaner.

No new feature is added in this PR, just modifying the existing functions and moving things around.

This PR is extracted from the bigger PR for easier review. The original PR is rapidsai#10548 for supporting nested type in `lists::contains`. As such, this blocks it.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#11019
Closes rapidsai#10862.

The `PER_THREAD_DEFAULT_STREAM` build option was deprecated in branch-22.06 via rapidsai#10877, and replaced with the new build option `CUDF_USE_PER_THREAD_DEFAULT_STREAM`.

This PR removes `PER_THREAD_DEFAULT_STREAM`.

I am putting this up as a draft because I am not certain if we want to do this in 22.08 or wait for a later release?

Authors:
  - Jim Brennan (https://github.com/jbrennan333)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#11134
This ensures that a consistent cuco version is used across all of RAPIDS. See rapidsai/rapids-cmake#201.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#11139
…11138)

We can use `rapids_cpm_libcudacxx` directly without this extra function. The only value it adds is setting the `LIBCUDACXX_INCLUDE_DIR` variable, which we don't use. See also rapidsai/cuspatial#538 (comment)

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#11138
…rapidsai#11128)

This PR adds a Java method to find out if a Map column contains a list of keys

Authors:
  - Raza Jafri (https://github.com/razajafri)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: rapidsai#11128
This PR introduces a small new utility function `_from_data_like_self` to provide an additional hook for `Frame` and `IndexedFrame` to customize `_from_data`. Specifically in instances where the outputs need to share the index of the input for `IndexedFrame` methods, this hook provides the necessary entrypoint. This method is analogous to the already existing `_from_columns_like_self` method.

In the past I made efforts to consolidate our factory methods, and I'm not super happy with proliferating more of them again. However, I think in this case it is worth the tradeoff since this gets us very close to our goal of removing the `_index` attribute from the `Frame` class. Moreover, in the intermediate term I expect both this method and `_from_data` to be subsumed by `_from_columns` when we eventually transition all of our internals (especially the `ColumnAccessor`) to store a list rather than a dict of columns.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#11140
This PR is a follow-up to rapidsai#11073. Its sole purpose is to remove the deprecated public `replace` API for Index objects. A stripped down, minimal version of this logic has been inlined in `DataFrame.rename`, which is the only place that needs that functionality. The scope of this PR is intentionally limited so that further changes can happen in parallel with changes associated with the removal of `Frame._index`, which should no longer be necessary after this PR is merged (aside from its potential usage in internal Cython code that can be freely worked around).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#11131
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jun 24, 2022
@skirui-source
Copy link
Contributor Author

skirui-source commented Jun 24, 2022

Closing PR due to a bad merge. Replaced by #11144

@skirui-source skirui-source deleted the groupby.pct_change branch June 24, 2022 02:45
rapids-bot bot pushed a commit that referenced this pull request Jul 18, 2022
Subsequent to #9805, this PR adds support for Groupby.pct_change()

Fixes #9606
Replaces #10444

Authors:
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #11144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Groupby pct_change