You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There has been a bunch of discussions in some issues around chunk size and its performance implications, I am raising this issue as an example for helping to track and fix it.
Get the top 10k values across categorical columns in the dataset
defget_value_counts_1_col_at_a_time(df, col_name_ls, threshold, client):
""" returns the top threshold values for the columns in col_name_ls """count_d= {}
# we can do the value count for each column in parallel but not done here to keep the example cleaner# here we calculate columns seriallyforcolincol_name_ls:
top_value_series=df[col].value_counts().head(threshold,npartitions=-1)
count_d[col+'_counts'] =cudf.Series(top_value_series.index)
returncount_d
Fill Na
df=df.fillna(-1)
Categorization
defcat_col_nvcat(num_s,encoding_key_sr):
""" Cast numerical columm to Categorical Uses indexes of encoding_key_sr anything not in index is encoded to -1 """fromlibrmm_cffiimportlibrmmcat=nvcategory.from_numbers(num_s.data.mem).set_keys(encoding_key_sr.data.mem)
device_array=librmm.device_array(num_s.data.size, dtype=np.int32)
cat.values(devptr=device_array.device_ctypes_pointer.value)
returncudf.Series(device_array)
defcat_nvcat(df,cat_col_names,count_d):
""" This function uses nvcategoty for categorization encode values for categorical columns from int-> int by using top values till value_threshold """forcolincat_col_names:
## this function uses nvcategory encoding function for categorizationcat_top_values=count_d[col+'_counts']
df[col] =cat_col_nvcat(df[col],cat_top_values)
returndf
Scale Down Experiments on cudf:
Length ratio
Number of rows
Value count time
Fill na time
Categorization time
Total Time
1
32,737,500
1.766069174
0.8439986706
3.982206821
6.592274666
2
16,368,750
1.363888025
0.5763037205
2.134787321
4.074979067
4
8,184,375
0.7610530853
0.377799511
1.265794754
2.40464735
8
4,092,187
0.8306720257
0.1273105145
0.570663929
1.528646469
16
2,046,093
0.6588602066
0.252651453
0.3722097874
1.283721447
Current Performance Guess:
On preliminary exploration cudf functions are not scaling down as they should but we are also enough parallelism with dask . My guess is when doing aggregate functions like value_counts we will have to incur communication + aggregation cost which appear to be non-negligible.
I am hoping communication will become much better at least on single node setups with UCX but for the time being chunk-size do appear to have an impact on performance.
@VibhuJawa I think something else here is if the chunksize is 555MB, with 20 columns that means there's ~28MB per column which sounds very small for a GPU. If there's 40 columns, then it's ~14MB per column which is even worse.
[QST] Chunksize Performance Test
There has been a bunch of discussions in some issues around chunk size and its performance implications, I am raising this issue as an example for helping to track and fix it.
Issues/Discussion link:
repartition
failing on multiple-workers #2321repartition
failing on multiple-workers #2321 (comment)The example workflow, I am showing here has 3 steps:
nvcategory
Nbviewer Link
Gist Link
Chunksize Performance Table:
Function details:
Scale Down Experiments on
cudf
:Current Performance Guess:
On preliminary exploration
cudf functions
are not scaling down as they should but we are also enough parallelism with dask . My guess is when doing aggregate functions likevalue_counts
we will have to incur communication + aggregation cost which appear to be non-negligible.I am hoping
communication
will become much better at least on single node setups with UCX but for the time beingchunk-size
do appear to have an impact on performance.Experiments still to run :
[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory dask-cuda#99 (comment)
Out of Memory Sort Fails even with Spill over dask-cuda#57 (comment)
CC: @mrocklin @pentschev @randerzander
The text was updated successfully, but these errors were encountered: