-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessor creates too many small chunks #2184
Labels
dask
related to improvements using Dask
Comments
Example recipe that suffers from this issue: preprocessors:
preproc:
custom_order: true
area_statistics:
operator: mean
annual_statistics:
operator: mean
convert_units:
units: 'degrees_C'
ensemble_statistics:
statistics:
- mean
multi_model_statistics:
statistics:
- mean
- p17
- p83
span: full
keep_input_datasets: false
ignore_scalar_coords: true
diagnostics:
diagnostic:
variables:
tos_ssp585:
short_name: tos
exp: ['historical', 'ssp585']
project: CMIP6
mip: Omon
preprocessor: preproc
timerange: '1850/2100'
scripts: null
datasets:
- {dataset: ACCESS-CM2, ensemble: 'r(1:5)i1p1f1', grid: gn}
- {dataset: ACCESS-ESM1-5, ensemble: 'r(1:40)i1p1f1', grid: gn}
- {dataset: AWI-CM-1-1-MR, ensemble: r1i1p1f1, grid: gn}
- {dataset: BCC-CSM2-MR, ensemble: r1i1p1f1, grid: gn}
- {dataset: CAS-ESM2-0, ensemble: r1i1p1f1, grid: gn}
- {dataset: CAS-ESM2-0, ensemble: r3i1p1f1, grid: gn}
- {dataset: CESM2, ensemble: r4i1p1f1, grid: gn}
- {dataset: CESM2, ensemble: 'r(10:11)i1p1f1', grid: gn}
- {dataset: CESM2-WACCM, ensemble: r1i1p1f1, grid: gn}
- {dataset: CIESM, ensemble: r1i1p1f1, grid: gn}
- {dataset: CMCC-CM2-SR5, ensemble: r1i1p1f1, grid: gn}
- {dataset: CMCC-ESM2, ensemble: r1i1p1f1, grid: gn}
- {dataset: CNRM-CM6-1, ensemble: 'r(1:6)i1p1f2', grid: gn}
- {dataset: CNRM-CM6-1-HR, ensemble: r1i1p1f2, grid: gn}
- {dataset: CNRM-ESM2-1, ensemble: 'r(1:5)i1p1f2', grid: gn}
- {dataset: CanESM5, ensemble: 'r(1:25)i1p(1:2)f1', grid: gn}
- {dataset: CanESM5-1, ensemble: 'r1i1p(1:2)f1', grid: gn, institute: CCCma}
- {dataset: CanESM5-CanOE, ensemble: 'r(1:3)i1p2f1', grid: gn}
- {dataset: EC-Earth3, ensemble: r1i1p1f1, grid: gn}
- {dataset: EC-Earth3, ensemble: r4i1p1f1, grid: gn}
- {dataset: EC-Earth3, ensemble: r6i1p1f1, grid: gn}
- {dataset: EC-Earth3, ensemble: r11i1p1f1, grid: gn}
- {dataset: EC-Earth3, ensemble: r15i1p1f1, grid: gn}
- {dataset: EC-Earth3-Veg, ensemble: 'r(1:4)i1p1f1', grid: gn}
- {dataset: EC-Earth3-Veg, ensemble: r6i1p1f1, grid: gn}
- {dataset: FGOALS-f3-L, ensemble: 'r(1:3)i1p1f1', grid: gn}
- {dataset: FGOALS-g3, ensemble: 'r(1:4)i1p1f1', grid: gn}
- {dataset: FIO-ESM-2-0, ensemble: 'r(1:3)i1p1f1', grid: gn}
- {dataset: GFDL-ESM4, ensemble: r1i1p1f1, grid: gn}
- {dataset: GISS-E2-1-G, ensemble: 'r(1:4)i1p5f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p5f1}, {short_name: areacello, skip: true}]}
- {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
- {dataset: GISS-E2-1-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
- {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p1f2', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
- {dataset: GISS-E2-1-H, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p3f1}, {short_name: areacello, skip: true}]}
- {dataset: GISS-E2-2-G, ensemble: 'r(1:5)i1p3f1', grid: gn, supplementary_variables: [{short_name: areacella, mip: fx, exp: piControl, ensemble: r1i1p1f1}, {short_name: areacello, skip: true}]}
- {dataset: HadGEM3-GC31-LL, ensemble: r1i1p1f3, grid: gn}
- {dataset: HadGEM3-GC31-MM, ensemble: r1i1p1f3, grid: gn}
- {dataset: INM-CM4-8, ensemble: r1i1p1f1, grid: gr1}
- {dataset: INM-CM5-0, ensemble: r1i1p1f1, grid: gr1}
- {dataset: IPSL-CM6A-LR, ensemble: 'r(1:4)i1p1f1', grid: gn}
- {dataset: IPSL-CM6A-LR, ensemble: r6i1p1f1, grid: gn}
- {dataset: IPSL-CM6A-LR, ensemble: r14i1p1f1, grid: gn}
- {dataset: MCM-UA-1-0, ensemble: r1i1p1f2, grid: gn}
- {dataset: MIROC-ES2H, ensemble: r1i1p4f2, grid: gn}
- {dataset: MIROC-ES2L, ensemble: 'r(1:10)i1p1f2', grid: gn}
- {dataset: MIROC6, ensemble: 'r(1:50)i1p1f1', grid: gn}
- {dataset: MPI-ESM1-2-HR, ensemble: 'r1i1p1f1', grid: gn}
- {dataset: MPI-ESM1-2-LR, ensemble: 'r(1:30)i1p1f1', grid: gn}
- {dataset: MRI-ESM2-0, ensemble: 'r(1:5)i1p1f1', grid: gn}
- {dataset: NorESM2-MM, ensemble: r1i1p1f1, grid: gn}
- {dataset: TaiESM1, ensemble: r1i1p1f1, grid: gn}
- {dataset: UKESM1-0-LL, ensemble: 'r(1:4)i1p1f2', grid: gn}
- {dataset: UKESM1-0-LL, ensemble: r8i1p1f2, grid: gn} |
Note that this problem is made worse by certain preprocessor functions that create many tiny chunks because their implementation is not optimal, e.g. #2179. |
Related issue in iris SciTools/iris#5455 |
11 tasks
Related interesting new feature: https://discourse.pangeo.io/t/rechunking-large-data-at-constant-memory-in-dask-experimental/3266 |
8 tasks
github-project-automation
bot
moved this from Todo
to Done
in ESiWACE3 ESMValTool service
Apr 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Many preprocessor functions change the shape of the data. E.g.
After such an operation, chunk sizes are not optimal and in cases where the data is reduced in size and/or dimensionality, we end up with many tiny chunks. This leads to computational performance issues that show up in the log as:
and Dask documentation on the topic is available here.
To solve this issue, the following two solutions come to mind:
We could also consider adding control over rechunking and peristence from the recipe, though I'm not sure how useful that will be.
The text was updated successfully, but these errors were encountered: