Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CoCiP-grid data access patterns to reduce duplicate chunk downloads #171

Merged
merged 9 commits into from
Apr 8, 2024

Conversation

thabbott
Copy link
Contributor

@thabbott thabbott commented Apr 2, 2024

Changes

Updates data access patterns in CocipGrid to avoid discarding and reloading met data after it has already been loaded into memory.

Benchmarks for the API preprocessor (https://github.com/contrailcirrus/api-preprocessor/tree/triage/coarse-fine-zarr-chunking):

main feature/cocip-grid-access-patterns
Remote zarr store Wall time: 386 s
Peak RAM: 3369 MiB
Ingress: 70.4 GiB
Model-level variables: 2.15 downloads/chunk
Single-level variables: 2.60 downloads/chunk
Wall time: 202 s
Peak RAM: 4037 MiB
Ingress: 38.3 GiB
Model-level variables: 1.17 downloads/chunk
Single-level variables: 1.80 downloads/chunk
Remote zarr store,
no dask
Wall time: 235 s
Peak RAM: 25298 MiB
Ingress: 58.8 GiB
Model-level variables: 1.55 downloads/chunk
Single-level variables: 1.97 downloads/chunk
Wall time: 202 s
Peak RAM: 28050 MiB
Ingress: 37.9 GiB
Model-level variables: 1.16 downloads/chunk
Single-level variables: 1.97 downloads/chunk
Local zarr store Wall time: 245 s
Peak RAM: 3726 MiB
Wall time: 210 s
Peak RAM: 4530 MiB
Local zarr store,
no dask
Wall time: 303 s
Peak RAM: 22952 MiB
Wall time: 297 s
Peak RAM: 22966 MiB

Numbers are all running on a c3d-highmem-4 VM (4 vCPUs, 32 GB RAM). Downloads/chunks are averages over all accessed chunks.

When using dask, the updated data access pattern cuts ingress and wall time by something like a factor of two, but comes with some memory overhead (a bit less than 1 GiB).

I've included benchmarks without dask (disabled by calling xr.open_zarr with chunks=None) to check that I'm not causing any major performance regressions for that case, but disabling dask increases memory consumption significantly, so I don't think those benchmarks are particularly relevant for the API preprocessor. (I will admit that I don't understand why dask-less benchmarks are slower with a local zarr store than a remote zarr store.... Not sure that's worth digging into, though.)

@nickmasson, @mlshapiro, open to feedback on whether this seems like a useful improvement. If the consensus is yes, I'll update unit tests (failing since the latest xarray release) and add a test or two to check for expected behavior in the updated _maybe_downselect_met_rad.

Internals

  • Modifies _maybe_downselect_met_rad method of CocipGrid to reuse in-memory data when new forecast steps are selected.

Tests

  • QC test passes locally (make test)
  • CI tests pass

Reviewer

Select @ reviewer

@nickmasson
Copy link
Contributor

@thabbott , this is solid 🦾

I'm of the opinion that it is worth adding these changes and cutting a release.

The additional RAM is a bit surprising, and it isn't a clear cut win for the case where we localize the zarr store. But, given that our nominal case is using a remote zarr store, this seems like a good improvement. Furthermore, if we are operating in a network-contrained environment, the ~50% decrease in data over the wire could lead to an even more substantive gain that what you've profiled here.

Unless there are other concerns, I think full steam ahead 🚀

Once there is a new release of pycontrails, I'll pull it into the api-preprocessor (along with other WIP), and get some numbers based on it running in k8s vs. a VM.

@thabbott thabbott force-pushed the feature/cocip-grid-access-patterns branch from 75b9af6 to 825eb39 Compare April 5, 2024 17:59
@thabbott
Copy link
Contributor Author

thabbott commented Apr 5, 2024

A few other notes after spending some time cleaning up this PR:

  1. Some unit tests broke because changes to netCDF decoding introduced in https://github.com/pydata/xarray/releases/tag/v2024.03.0 (correctly encode/decode _FillValues/missing_values/dtypes for packed data pydata/xarray#8713) lead to slight changes in some fixtures that rely on netCDF files with variables serialized as int16. This isn't something that you can fix by coercing from float64 to float32 after decoding (the value you get still differs from the old decoding straight from int16 to float32). I've updated unit tests to pass for v2024.03.0 (currently used in CI), but this means that unit tests will fail with older versions of xarray. Ideally we would update fixtures to read from netCDF files with values serialized as floats (no decoding necessary), but I don't think we have all the scripts used to generate existing fixtures.
  2. scipy 1.13.0 (https://github.com/scipy/scipy/releases/tag/v1.13.0) introduced some changes to spline interpolation (ENH: use NdBSpline in RegularGridInterpolator to speed up evaluations scipy/scipy#19633) that are incompatible with the current pycontrails RegularGridInterpolator. I'm dealing with this by explicitly falling back to legacy interpolation methods when using scipy 1.13.0 and newer, but we should consider whether we want to support the new spline interpolation methods.

@thabbott thabbott requested review from mlshapiro and nickmasson April 5, 2024 19:48
@thabbott thabbott marked this pull request as ready for review April 5, 2024 19:49
@thabbott thabbott changed the title WIP: update CoCiP-grid data access patterns to reduce duplicate chunk downloads Update CoCiP-grid data access patterns to reduce duplicate chunk downloads Apr 5, 2024
Copy link
Contributor

@nickmasson nickmasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to the implementation adjustments following the package version updates. But the implementation logic vis a vis what we discussed looks appropriate, and the summary numbers in the table position this change as a valuable enhancement.

noting:
I haven't been following the recent releases for pycontrails, but it appears that the recent resample_and_fill updates will also go out with this release. I'll make adjustments, if warranted, to the flights-pipeline/spire-ingest-resample-worker, as that was the instigator for those changes.

Copy link
Contributor

@mlshapiro mlshapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good, thank you for fixing the scipy problem.

If you want me to check into the nc file fixtures for the tests, i'm happy to go back through that

pycontrails/core/interpolation.py Show resolved Hide resolved
tests/unit/test_cocip_grid.py Show resolved Hide resolved
tests/unit/conftest.py Show resolved Hide resolved
@thabbott thabbott merged commit ff3425d into main Apr 8, 2024
11 checks passed
@thabbott thabbott deleted the feature/cocip-grid-access-patterns branch April 8, 2024 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants