Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

Open
37khalil opened this issue Jun 15, 2024 · 3 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cudf.pandas Issues specific to cudf.pandas doc Documentation Python Affects Python cuDF API.

Comments

@37khalil
Copy link

From reading the value_counts() docs, i gathered that the function would be working exactly the same as the pandas implementation, but upon using it, it became clear that the output is different in some cases, specifically when using the bins parameter:

  • Input data is exactly the same between the 2 runs (Pandas and CuDF), which is a Series of 2734 numbers
cudf_data = cudf.from_pandas(pandas_data)
  • When running the function in pandas the result is the correct and desired output (counting through 5 intervals between the min and max)
pandas_data.value_counts(bins=5, sort=False)
(-2038.538, 407707.479]       1493
(407707.479, 815414.958]      1240
(815414.958, 1223122.436]        0
(1223122.436, 1630829.915]       0
(1630829.915, 2038537.394]       1
Name: count, dtype: int64
  • But when running the same data in CuDF it seems that the function doesn't guarantee the desired number of bins to be in the output
cudf_data.value_counts(bins=5, sort=False)
(1630829.915, 2038537.394]       1
(407707.479, 815414.958]      1240
(-2038.538, 407707.479]       1493
Name: count, dtype: int64
  1. From comparing the outputs, it looks like the CuDF implementation ignores the intervals where the count is 0.
  2. The order of the output doesn't seem to be the same even though i've set sort=False and the input data is the same in order in both cases.

Is this by design, or should the 2 implementations produce the same output ?

@37khalil 37khalil added the question Further information is requested label Jun 15, 2024
@lithomas1 lithomas1 added 0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API. doc Documentation cudf.pandas Issues specific to cudf.pandas and removed question Further information is requested labels Jun 17, 2024
@lithomas1
Copy link
Contributor

I think issue 1 is a bug in cudf (or cudf.pandas at least).

I think the second bullet is not a bug in cudf, since when sort=False, the order is not guaranteed to be the same as in pandas.
This is something that can be better documented though.

@vyasr
Copy link
Contributor

vyasr commented Jun 25, 2024

Under the hood value_counts is basically a groupby-count. The reason for issue 1 is that cudf does not support the observed parameter (tracked in #15173). The underlying implementation can be studied with the below:

In [25]: arr = cp.random.rand(10000) * 100
In [26]: arr[arr < 50] = 0
In [27]: s = cudf.Series(arr)
In [28]: series_bins = cudf.cut(s, 11)
In [29]: s.groupby(series_bins).count()
Out[29]: 
(-0.1, 9.091]       5021
(90.906, 99.997]     922
(45.453, 54.544]     416
(72.725, 81.816]     881
(81.816, 90.906]     928
(54.544, 63.634]     917
(63.634, 72.725]     915
dtype: int64
In [30]: s.to_pandas().groupby(series_bins.to_pandas()).count()
<ipython-input-30-2f3e95bd02d9>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s.to_pandas().groupby(series_bins.to_pandas()).count()
Out[30]: 
(-0.1, 9.091]       5021
(9.091, 18.181]        0
(18.181, 27.272]       0
(27.272, 36.362]       0
(36.362, 45.453]       0
(45.453, 54.544]     416
(54.544, 63.634]     917
(63.634, 72.725]     915
(72.725, 81.816]     881
(81.816, 90.906]     928
(90.906, 99.997]     922
dtype: int64

The pandas groupby default behavior is going to change to observed=True in a future version, which may also result in a change in the behavior of their value_counts implementation, but that remains to be seen. In the interim, if #15173 is resolved we could pass the parameter down to groupby from value_counts to match the pandas behavior.

@vyasr
Copy link
Contributor

vyasr commented Jun 25, 2024

Also yes Thomas is correct that we shouldn't expect pandas and cudf to produce identical ordering when sort=False. With cudf.pandas we may eventually try to change that using a compatibility mode, but that will require addressing #12236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cudf.pandas Issues specific to cudf.pandas doc Documentation Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants