Preprocessor creates too many small chunks #2184

bouweandela · 2023-08-29T07:25:25Z

Many preprocessor functions change the shape of the data. E.g.

area_statistics
annual_statistics
regrid
extract_levels

After such an operation, chunk sizes are not optimal and in cases where the data is reduced in size and/or dimensionality, we end up with many tiny chunks. This leads to computational performance issues that show up in the log as:

2023-08-24 17:29:41,646 - distributed.utils_perf - WARNING - full garbage collections took 30% CPU time recently (threshold: 10%)

and Dask documentation on the topic is available here.

To solve this issue, the following two solutions come to mind:

always rechunk based on some condition (e.g. changed data shape, too small/large chunk sizes)
implement a specific rechunk as part of each preprocessor function that needs it

We could also consider adding control over rechunking and peristence from the recipe, though I'm not sure how useful that will be.

The text was updated successfully, but these errors were encountered:

bouweandela · 2023-08-29T08:44:44Z

Example recipe that suffers from this issue:

preprocessors:
  preproc:
    custom_order: true
    area_statistics:
      operator: mean
    annual_statistics:
      operator: mean
    convert_units:
      units: 'degrees_C'
    ensemble_statistics:
      statistics:
        - mean
    multi_model_statistics:
      statistics:
        - mean
        - p17
        - p83
      span: full
      keep_input_datasets: false
      ignore_scalar_coords: true

diagnostics:
  diagnostic:
    variables:
      tos_ssp585:
        short_name: tos
        exp: ['historical', 'ssp585']
        project: CMIP6
        mip: Omon
        preprocessor: preproc
        timerange: '1850/2100'
    scripts: null

datasets:
  - {dataset: ACCESS-CM2, ensemble: 'r(1:5)i1p1f1', grid: gn}
  - {dataset: ACCESS-ESM1-5, ensemble: 'r(1:40)i1p1f1', grid: gn}
  - {dataset: AWI-CM-1-1-MR, ensemble: r1i1p1f1, grid: gn}
  - {dataset: BCC-CSM2-MR, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CAS-ESM2-0, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CAS-ESM2-0, ensemble: r3i1p1f1, grid: gn}
  - {dataset: CESM2, ensemble: r4i1p1f1, grid: gn}
  - {dataset: CESM2, ensemble: 'r(10:11)i1p1f1', grid: gn}
  - {dataset: CESM2-WACCM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CIESM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CMCC-CM2-SR5, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CMCC-ESM2, ensemble: r1i1p1f1, grid: gn}
  - {dataset: CNRM-CM6-1, ensemble: 'r(1:6)i1p1f2', grid: gn}
  - {dataset: CNRM-CM6-1-HR, ensemble: r1i1p1f2, grid: gn}
  - {dataset: CNRM-ESM2-1, ensemble: 'r(1:5)i1p1f2', grid: gn}
  - {dataset: CanESM5, ensemble: 'r(1:25)i1p(1:2)f1', grid: gn}
  - {dataset: CanESM5-1, ensemble: 'r1i1p(1:2)f1', grid: gn, institute: CCCma}
  - {dataset: CanESM5-CanOE, ensemble: 'r(1:3)i1p2f1', grid: gn}
  - {dataset: EC-Earth3, ensemble: r1i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r4i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r6i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r11i1p1f1, grid: gn}
  - {dataset: EC-Earth3, ensemble: r15i1p1f1, grid: gn}
  - {dataset: EC-Earth3-Veg, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: EC-Earth3-Veg, ensemble: r6i1p1f1, grid: gn}
  - {dataset: FGOALS-f3-L, ensemble: 'r(1:3)i1p1f1', grid: gn}
  - {dataset: FGOALS-g3, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: FIO-ESM-2-0, ensemble: 'r(1:3)i1p1f1', grid: gn}
  - {dataset: GFDL-ESM4, ensemble: r1i1p1f1, grid: gn}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:4)i1p5f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p5f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
  - {dataset: GISS-E2-2-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
  - {dataset: HadGEM3-GC31-LL, ensemble: r1i1p1f3, grid: gn}
  - {dataset: HadGEM3-GC31-MM, ensemble: r1i1p1f3, grid: gn}
  - {dataset: INM-CM4-8, ensemble: r1i1p1f1, grid: gr1}
  - {dataset: INM-CM5-0, ensemble: r1i1p1f1, grid: gr1}
  - {dataset: IPSL-CM6A-LR, ensemble: 'r(1:4)i1p1f1', grid: gn}
  - {dataset: IPSL-CM6A-LR, ensemble: r6i1p1f1, grid: gn}
  - {dataset: IPSL-CM6A-LR, ensemble: r14i1p1f1, grid: gn}
  - {dataset: MCM-UA-1-0, ensemble: r1i1p1f2, grid: gn}
  - {dataset: MIROC-ES2H, ensemble: r1i1p4f2, grid: gn}
  - {dataset: MIROC-ES2L, ensemble: 'r(1:10)i1p1f2', grid: gn}
  - {dataset: MIROC6, ensemble: 'r(1:50)i1p1f1', grid: gn}
  - {dataset: MPI-ESM1-2-HR, ensemble: 'r1i1p1f1', grid: gn}
  - {dataset: MPI-ESM1-2-LR, ensemble: 'r(1:30)i1p1f1', grid: gn}
  - {dataset: MRI-ESM2-0, ensemble: 'r(1:5)i1p1f1', grid: gn}
  - {dataset: NorESM2-MM, ensemble: r1i1p1f1, grid: gn}
  - {dataset: TaiESM1, ensemble: r1i1p1f1, grid: gn}
  - {dataset: UKESM1-0-LL, ensemble: 'r(1:4)i1p1f2', grid: gn}
  - {dataset: UKESM1-0-LL, ensemble: r8i1p1f2, grid: gn}

bouweandela · 2023-08-29T08:49:00Z

Note that this problem is made worse by certain preprocessor functions that create many tiny chunks because their implementation is not optimal, e.g. #2179.

bouweandela · 2023-09-01T15:07:30Z

Related issue in iris SciTools/iris#5455

bouweandela · 2023-09-25T06:54:00Z

Related interesting new feature: https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266

bouweandela added the dask related to improvements using Dask label Aug 29, 2023

bouweandela mentioned this issue Sep 12, 2023

Add a realistic example recipe ESMValGroup/ESMValTool#3356

Merged

11 tasks

bouweandela mentioned this issue Sep 22, 2023

Add Dask graph debug info #2199

Draft

10 tasks

bouweandela mentioned this issue Sep 26, 2023

Rechunk between preprocessor steps #2205

Merged

8 tasks

bouweandela added this to ESiWACE3 ESMValTool service Feb 12, 2024

bouweandela moved this to Todo in ESiWACE3 ESMValTool service Feb 12, 2024

bouweandela closed this as completed Apr 3, 2024

github-project-automation bot moved this from Todo to Done in ESiWACE3 ESMValTool service Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessor creates too many small chunks #2184

Preprocessor creates too many small chunks #2184

bouweandela commented Aug 29, 2023 •

edited

Loading

bouweandela commented Aug 29, 2023

bouweandela commented Aug 29, 2023

bouweandela commented Sep 1, 2023

bouweandela commented Sep 25, 2023

Preprocessor creates too many small chunks #2184

Preprocessor creates too many small chunks #2184

Comments

bouweandela commented Aug 29, 2023 • edited Loading

bouweandela commented Aug 29, 2023

bouweandela commented Aug 29, 2023

bouweandela commented Sep 1, 2023

bouweandela commented Sep 25, 2023

bouweandela commented Aug 29, 2023 •

edited

Loading