Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Chunksize Performance Test #2498

Closed
VibhuJawa opened this issue Aug 7, 2019 · 2 comments
Closed

[QST] Chunksize Performance Test #2498

VibhuJawa opened this issue Aug 7, 2019 · 2 comments
Labels
dask Dask issue Python Affects Python cuDF API. question Further information is requested

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Aug 7, 2019

[QST] Chunksize Performance Test

There has been a bunch of discussions in some issues around chunk size and its performance implications, I am raising this issue as an example for helping to track and fix it.

Issues/Discussion link:

The example workflow, I am showing here has 3 steps:

  • Get the top 10k values across categorical columns (20/40 columns) in the dataset .
  • Fill Na
  • Categorizations using the top 10k values using nvcategory

Nbviewer Link
Gist Link

Chunksize Performance Table:

Number of total partitions Total Partitions per worker Chunksize(MiB) Value count time Fill na time Categorization time Total Time
8 1 4440 5.774513006 1.454973221 5.105342865 12.33482909
16 2 2220 8.065324783 2.715443373 6.412117481 17.19288564
32 4 1110 9.272883892 3.274447441 9.87819767 22.425529
64 8 555 14.78893876 6.186647654 18.99973917 39.97532558
136 17 277 33.36878991 13.56430411 39.71643448 86.6495285

Function details:

  • Get the top 10k values across categorical columns in the dataset
def get_value_counts_1_col_at_a_time(df, col_name_ls, threshold, client):
    """
        returns the top threshold values for the columns in col_name_ls
    """    
    count_d = {}
    # we can do the value count for each column in parallel but not done here to keep the example cleaner
    # here we calculate columns serially
    for col in col_name_ls:
        top_value_series = df[col].value_counts().head(threshold,npartitions=-1)
        count_d[col+'_counts']  = cudf.Series(top_value_series.index)
        
    return count_d
  • Fill Na
df =  df.fillna(-1)
  • Categorization
def cat_col_nvcat(num_s,encoding_key_sr):
    """
        Cast numerical columm to Categorical
        Uses indexes of encoding_key_sr
        anything not in index is encoded to -1
    """
    from librmm_cffi import librmm
    
    cat = nvcategory.from_numbers(num_s.data.mem).set_keys(encoding_key_sr.data.mem)
    device_array = librmm.device_array(num_s.data.size, dtype=np.int32)
    cat.values(devptr=device_array.device_ctypes_pointer.value)
    
    return cudf.Series(device_array)


def cat_nvcat(df,cat_col_names,count_d):
    """
        This function uses nvcategoty for categorization
    
        encode values for categorical columns from
        int-> int by using top values till value_threshold
    """
    for col in cat_col_names:
        ## this function uses nvcategory encoding function for categorization
        cat_top_values = count_d[col+'_counts']
        df[col] = cat_col_nvcat(df[col],cat_top_values)

    return df

Scale Down Experiments on cudf:

Length ratio Number of rows Value count time Fill na time Categorization time Total Time
1 32,737,500 1.766069174 0.8439986706 3.982206821 6.592274666
2 16,368,750 1.363888025 0.5763037205 2.134787321 4.074979067
4 8,184,375 0.7610530853 0.377799511 1.265794754 2.40464735
8 4,092,187 0.8306720257 0.1273105145 0.570663929 1.528646469
16 2,046,093 0.6588602066 0.252651453 0.3722097874 1.283721447

Current Performance Guess:

On preliminary exploration cudf functions are not scaling down as they should but we are also enough parallelism with dask . My guess is when doing aggregate functions like value_counts we will have to incur communication + aggregation cost which appear to be non-negligible.

I am hoping communication will become much better at least on single node setups with UCX but for the time being chunk-size do appear to have an impact on performance.

Experiments still to run :

CC: @mrocklin @pentschev @randerzander

@VibhuJawa VibhuJawa added Needs Triage Need team to review and classify bug Something isn't working labels Aug 7, 2019
@kkraus14 kkraus14 added cuIO cuIO issue question Further information is requested and removed Needs Triage Need team to review and classify bug Something isn't working labels Aug 15, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. dask Dask issue and removed cuIO cuIO issue labels Aug 15, 2019
@kkraus14
Copy link
Collaborator

@VibhuJawa I think something else here is if the chunksize is 555MB, with 20 columns that means there's ~28MB per column which sounds very small for a GPU. If there's 40 columns, then it's ~14MB per column which is even worse.

@kkraus14
Copy link
Collaborator

Closing as this is stale and not an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue Python Affects Python cuDF API. question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants