[QST] Chunksize Performance Test #2498

VibhuJawa · 2019-08-07T21:58:50Z

[QST] Chunksize Performance Test

There has been a bunch of discussions in some issues around chunk size and its performance implications, I am raising this issue as an example for helping to track and fix it.

Issues/Discussion link:

The example workflow, I am showing here has 3 steps:

Get the top 10k values across categorical columns (20/40 columns) in the dataset .
Fill Na
Categorizations using the top 10k values using nvcategory

Nbviewer Link
Gist Link

Chunksize Performance Table:

Number of total partitions	Total Partitions per worker	Chunksize(MiB)	Value count time	Fill na time	Categorization time	Total Time
8	1	4440	5.774513006	1.454973221	5.105342865	12.33482909
16	2	2220	8.065324783	2.715443373	6.412117481	17.19288564
32	4	1110	9.272883892	3.274447441	9.87819767	22.425529
64	8	555	14.78893876	6.186647654	18.99973917	39.97532558
136	17	277	33.36878991	13.56430411	39.71643448	86.6495285

Function details:

Get the top 10k values across categorical columns in the dataset

def get_value_counts_1_col_at_a_time(df, col_name_ls, threshold, client):
    """
        returns the top threshold values for the columns in col_name_ls
    """    
    count_d = {}
    # we can do the value count for each column in parallel but not done here to keep the example cleaner
    # here we calculate columns serially
    for col in col_name_ls:
        top_value_series = df[col].value_counts().head(threshold,npartitions=-1)
        count_d[col+'_counts']  = cudf.Series(top_value_series.index)
        
    return count_d

Fill Na

df =  df.fillna(-1)

Categorization

def cat_col_nvcat(num_s,encoding_key_sr):
    """
        Cast numerical columm to Categorical
        Uses indexes of encoding_key_sr
        anything not in index is encoded to -1
    """
    from librmm_cffi import librmm
    
    cat = nvcategory.from_numbers(num_s.data.mem).set_keys(encoding_key_sr.data.mem)
    device_array = librmm.device_array(num_s.data.size, dtype=np.int32)
    cat.values(devptr=device_array.device_ctypes_pointer.value)
    
    return cudf.Series(device_array)


def cat_nvcat(df,cat_col_names,count_d):
    """
        This function uses nvcategoty for categorization
    
        encode values for categorical columns from
        int-> int by using top values till value_threshold
    """
    for col in cat_col_names:
        ## this function uses nvcategory encoding function for categorization
        cat_top_values = count_d[col+'_counts']
        df[col] = cat_col_nvcat(df[col],cat_top_values)

    return df

Scale Down Experiments on cudf:

Length ratio	Number of rows	Value count time	Fill na time	Categorization time	Total Time
1	32,737,500	1.766069174	0.8439986706	3.982206821	6.592274666
2	16,368,750	1.363888025	0.5763037205	2.134787321	4.074979067
4	8,184,375	0.7610530853	0.377799511	1.265794754	2.40464735
8	4,092,187	0.8306720257	0.1273105145	0.570663929	1.528646469
16	2,046,093	0.6588602066	0.252651453	0.3722097874	1.283721447

Current Performance Guess:

On preliminary exploration cudf functions are not scaling down as they should but we are also enough parallelism with dask . My guess is when doing aggregate functions like value_counts we will have to incur communication + aggregation cost which appear to be non-negligible.

I am hoping communication will become much better at least on single node setups with UCX but for the time being chunk-size do appear to have an impact on performance.

Experiments still to run :

Enable/Disable RMM:
[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory dask-cuda#99 (comment)
Out of Memory Sort Fails even with Spill over dask-cuda#57 (comment)

CC: @mrocklin @pentschev @randerzander

The text was updated successfully, but these errors were encountered:

kkraus14 · 2019-08-15T23:06:12Z

@VibhuJawa I think something else here is if the chunksize is 555MB, with 20 columns that means there's ~28MB per column which sounds very small for a GPU. If there's 40 columns, then it's ~14MB per column which is even worse.

kkraus14 · 2020-04-28T19:46:45Z

Closing as this is stale and not an issue.

VibhuJawa added Needs Triage Need team to review and classify bug Something isn't working labels Aug 7, 2019

kkraus14 added cuIO cuIO issue question Further information is requested and removed Needs Triage Need team to review and classify bug Something isn't working labels Aug 15, 2019

kkraus14 added Python Affects Python cuDF API. dask Dask issue and removed cuIO cuIO issue labels Aug 15, 2019

kkraus14 closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Chunksize Performance Test #2498

[QST] Chunksize Performance Test #2498

VibhuJawa commented Aug 7, 2019 •

edited

Loading

kkraus14 commented Aug 15, 2019

kkraus14 commented Apr 28, 2020

[QST] Chunksize Performance Test #2498

[QST] Chunksize Performance Test #2498

Comments

VibhuJawa commented Aug 7, 2019 • edited Loading