[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

37khalil · 2024-06-15T16:01:45Z

From reading the value_counts() docs, i gathered that the function would be working exactly the same as the pandas implementation, but upon using it, it became clear that the output is different in some cases, specifically when using the bins parameter:

Input data is exactly the same between the 2 runs (Pandas and CuDF), which is a Series of 2734 numbers

cudf_data = cudf.from_pandas(pandas_data)

When running the function in pandas the result is the correct and desired output (counting through 5 intervals between the min and max)

pandas_data.value_counts(bins=5, sort=False)

(-2038.538, 407707.479]       1493
(407707.479, 815414.958]      1240
(815414.958, 1223122.436]        0
(1223122.436, 1630829.915]       0
(1630829.915, 2038537.394]       1
Name: count, dtype: int64

But when running the same data in CuDF it seems that the function doesn't guarantee the desired number of bins to be in the output

cudf_data.value_counts(bins=5, sort=False)

(1630829.915, 2038537.394]       1
(407707.479, 815414.958]      1240
(-2038.538, 407707.479]       1493
Name: count, dtype: int64

From comparing the outputs, it looks like the CuDF implementation ignores the intervals where the count is 0.
The order of the output doesn't seem to be the same even though i've set sort=False and the input data is the same in order in both cases.

Is this by design, or should the 2 implementations produce the same output ?

The text was updated successfully, but these errors were encountered:

lithomas1 · 2024-06-17T17:42:58Z

I think issue 1 is a bug in cudf (or cudf.pandas at least).

I think the second bullet is not a bug in cudf, since when sort=False, the order is not guaranteed to be the same as in pandas.
This is something that can be better documented though.

vyasr · 2024-06-25T20:42:21Z

Under the hood value_counts is basically a groupby-count. The reason for issue 1 is that cudf does not support the observed parameter (tracked in #15173). The underlying implementation can be studied with the below:

In [25]: arr = cp.random.rand(10000) * 100
In [26]: arr[arr < 50] = 0
In [27]: s = cudf.Series(arr)
In [28]: series_bins = cudf.cut(s, 11)
In [29]: s.groupby(series_bins).count()
Out[29]: 
(-0.1, 9.091]       5021
(90.906, 99.997]     922
(45.453, 54.544]     416
(72.725, 81.816]     881
(81.816, 90.906]     928
(54.544, 63.634]     917
(63.634, 72.725]     915
dtype: int64
In [30]: s.to_pandas().groupby(series_bins.to_pandas()).count()
<ipython-input-30-2f3e95bd02d9>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s.to_pandas().groupby(series_bins.to_pandas()).count()
Out[30]: 
(-0.1, 9.091]       5021
(9.091, 18.181]        0
(18.181, 27.272]       0
(27.272, 36.362]       0
(36.362, 45.453]       0
(45.453, 54.544]     416
(54.544, 63.634]     917
(63.634, 72.725]     915
(72.725, 81.816]     881
(81.816, 90.906]     928
(90.906, 99.997]     922
dtype: int64

The pandas groupby default behavior is going to change to observed=True in a future version, which may also result in a change in the behavior of their value_counts implementation, but that remains to be seen. In the interim, if #15173 is resolved we could pass the parameter down to groupby from value_counts to match the pandas behavior.

vyasr · 2024-06-25T20:45:02Z

Also yes Thomas is correct that we shouldn't expect pandas and cudf to produce identical ordering when sort=False. With cudf.pandas we may eventually try to change that using a compatibility mode, but that will require addressing #12236.

37khalil added the question Further information is requested label Jun 15, 2024

lithomas1 added 0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API. doc Documentation cudf.pandas Issues specific to cudf.pandas and removed question Further information is requested labels Jun 17, 2024

github-project-automation bot added this to cuDF Python Jun 17, 2024

github-project-automation bot moved this to Todo in cuDF Python Jun 17, 2024

vyasr mentioned this issue Jun 25, 2024

[FEA] Implement observed parameter for groupby #15173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

37khalil commented Jun 15, 2024

lithomas1 commented Jun 17, 2024

vyasr commented Jun 25, 2024

vyasr commented Jun 25, 2024

[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

[QST] Is Series.value_counts supposed to behave differently between Pandas and CuDF ? #16044

Comments

37khalil commented Jun 15, 2024

lithomas1 commented Jun 17, 2024

vyasr commented Jun 25, 2024

vyasr commented Jun 25, 2024