Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessor creates too many small chunks #2184

Closed
bouweandela opened this issue Aug 29, 2023 · 4 comments
Closed

Preprocessor creates too many small chunks #2184

bouweandela opened this issue Aug 29, 2023 · 4 comments
Labels
dask related to improvements using Dask

Comments

@bouweandela
Copy link
Member

bouweandela commented Aug 29, 2023

Many preprocessor functions change the shape of the data. E.g.

  • area_statistics
  • annual_statistics
  • regrid
  • extract_levels

After such an operation, chunk sizes are not optimal and in cases where the data is reduced in size and/or dimensionality, we end up with many tiny chunks. This leads to computational performance issues that show up in the log as:

2023-08-24 17:29:41,646 - distributed.utils_perf - WARNING - full garbage collections took 30% CPU time recently (threshold: 10%)

and Dask documentation on the topic is available here.

To solve this issue, the following two solutions come to mind:

  1. always rechunk based on some condition (e.g. changed data shape, too small/large chunk sizes)
  2. implement a specific rechunk as part of each preprocessor function that needs it

We could also consider adding control over rechunking and peristence from the recipe, though I'm not sure how useful that will be.

@bouweandela bouweandela added the dask related to improvements using Dask label Aug 29, 2023
@bouweandela
Copy link
Member Author

Example recipe that suffers from this issue:

preprocessors:
  preproc:
    custom_order: true
    area_statistics:
      operator: mean
    annual_statistics:
      operator: mean
    convert_units:
      units: 'degrees_C'
    ensemble_statistics:
      statistics:
        - mean
    multi_model_statistics:
      statistics:
        - mean
        - p17
        - p83
      span: full
      keep_input_datasets: false
      ignore_scalar_coords: true

diagnostics:
  diagnostic:
    variables:
      tos_ssp585:
        short_name: tos
        exp: ['historical', 'ssp585']
        project: CMIP6
        mip: Omon
        preprocessor: preproc
        timerange: '1850/2100'
    scripts: null

datasets:
  - {dataset: ACCESS-CM2, ensemble: 'r(1:5)i1p1f1', grid: gn}
  - {dataset: ACCESS-ESM1-5, ensemble: 'r(1:40)i1p1f1', grid: gn}
  - {dataset: AWI-CM-1-1-MR, ensemble: r1i1p1f1, grid: gn}
  - {dataset: BCC-CSM2-MR, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CAS-ESM2-0, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CAS-ESM2-0, ensemble: r3i1p1f1, grid: gn}
  - {dataset: CESM2, ensemble: r4i1p1f1, grid: gn}
  - {dataset: CESM2, ensemble: 'r(10:11)i1p1f1', grid: gn}
  - {dataset: CESM2-WACCM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CIESM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CMCC-CM2-SR5, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CMCC-ESM2, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CNRM-CM6-1, ensemble: 'r(1:6)i1p1f2', grid: gn}
  - {dataset: CNRM-CM6-1-HR, ensemble: r1i1p1f2, grid: gn}
  - {dataset: CNRM-ESM2-1, ensemble: 'r(1:5)i1p1f2', grid: gn}
  - {dataset: CanESM5, ensemble: 'r(1:25)i1p(1:2)f1', grid: gn}
  - {dataset: CanESM5-1, ensemble: 'r1i1p(1:2)f1', grid: gn, institute: CCCma}
  - {dataset: CanESM5-CanOE, ensemble: 'r(1:3)i1p2f1', grid: gn}
  - {dataset: EC-Earth3, ensemble: r1i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r4i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r6i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r11i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r15i1p1f1, grid: gn}
  - {dataset: EC-Earth3-Veg, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: EC-Earth3-Veg, ensemble: r6i1p1f1, grid: gn}
  - {dataset: FGOALS-f3-L, ensemble: 'r(1:3)i1p1f1', grid: gn}
  - {dataset: FGOALS-g3, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: FIO-ESM-2-0, ensemble: 'r(1:3)i1p1f1', grid: gn}
  - {dataset: GFDL-ESM4, ensemble: r1i1p1f1, grid: gn}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:4)i1p5f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p5f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-2-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: HadGEM3-GC31-LL, ensemble: r1i1p1f3, grid: gn}
  - {dataset: HadGEM3-GC31-MM, ensemble: r1i1p1f3, grid: gn}
  - {dataset: INM-CM4-8, ensemble: r1i1p1f1, grid: gr1}
  - {dataset: INM-CM5-0, ensemble: r1i1p1f1, grid: gr1}
  - {dataset: IPSL-CM6A-LR, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: IPSL-CM6A-LR, ensemble: r6i1p1f1, grid: gn}
  - {dataset: IPSL-CM6A-LR, ensemble: r14i1p1f1, grid: gn}
  - {dataset: MCM-UA-1-0, ensemble: r1i1p1f2, grid: gn}
  - {dataset: MIROC-ES2H, ensemble: r1i1p4f2, grid: gn}
  - {dataset: MIROC-ES2L, ensemble: 'r(1:10)i1p1f2', grid: gn}
  - {dataset: MIROC6, ensemble: 'r(1:50)i1p1f1', grid: gn}
  - {dataset: MPI-ESM1-2-HR, ensemble: 'r1i1p1f1', grid: gn}
  - {dataset: MPI-ESM1-2-LR, ensemble: 'r(1:30)i1p1f1', grid: gn}
  - {dataset: MRI-ESM2-0, ensemble: 'r(1:5)i1p1f1', grid: gn}
  - {dataset: NorESM2-MM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: TaiESM1, ensemble: r1i1p1f1, grid: gn}
  - {dataset: UKESM1-0-LL, ensemble: 'r(1:4)i1p1f2', grid: gn}
  - {dataset: UKESM1-0-LL, ensemble: r8i1p1f2, grid: gn}

@bouweandela
Copy link
Member Author

Note that this problem is made worse by certain preprocessor functions that create many tiny chunks because their implementation is not optimal, e.g. #2179.

@bouweandela
Copy link
Member Author

Related issue in iris SciTools/iris#5455

@bouweandela
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask related to improvements using Dask
Projects
Development

No branches or pull requests

1 participant