Refactor patterns #101

rabernat · 2021-04-14T15:08:18Z

This is a big refactor that closes #99.

What I am changing

I have attempted to separate concerns more clearly by expanding the scope of the patterns module and reducing the scope of the recipes module. The patterns.FilePattern class now contains a more flexible and comprehensive representation of file patterns, allowing an arbitrary number of "combine dimensions" to create a hypercube of files. (Internally it is stored as an xarray dataset). A function passed to this class tells how to generate the filename from the keys. A concrete example from the tests

concat = ConcatDim(name="time", keys=list(range(3)))
merge = MergeDim(name="variable", keys=["foo", "bar"])
def format_function(time, variable):
    return f"T_{time}_V_{variable}"
fp = FilePattern(format_function, merge, concat)

This is a bit java-esque in its use of classes, but I find it useful to have these things be very explicit.

Because of this, the Recipe class has a lot less to worry about. It now takes a FilePattern as an argument and figures out what to do with it through introspection. Now there is only one recipe class: XarrayZarrRecipe.

How I did it

Refactored the code while leaving the test suite mostly unchanged. Good tests FTW! 🎉

How you can test it

Try to create a new recipe using this new syntax.
Try to implement a new type of recipe

TODO

Reorganize code by moveing XarrayZarrRecipe into its own module inside a recipes subdirectory.
Update docstrings
Update tutorials
Update narrative documentation

rabernat · 2021-04-14T15:21:24Z

cfgrib-related error, looks like an environment problem

``` import xarray as xr /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/__init__.py:3: in from . import testing, tutorial, ufuncs /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/tutorial.py:14: in from .backends.api import open_dataset as _open_dataset /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/__init__.py:6: in from .cfgrib_ import CfGribDataStore /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:18: in import cfgrib /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/__init__.py:19: in from .cfmessage import CfMessage /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/cfmessage.py:29: in from . import messages /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/messages.py:28: in import eccodes # type: ignore /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/__init__.py:15: in from .eccodes import * /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/eccodes.py:12: in from gribapi import __version__ /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/__init__.py:13: in from .gribapi import * # noqa /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/gribapi.py:32: in from .bindings import ENC, ffi, lib /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/bindings.py:29: in from ._bindings import ffi, lib E ImportError: /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header Error: Process completed with exit code 4. ```

Looks like there was a cfgrib release a few days ago: https://github.com/ecmwf/cfgrib/releases/tag/0.9.9.0

martindurant · 2021-04-27T14:32:25Z

Sorry, some disperate thought coming up.

Your design here is rather similar to the "gen" feature of fsspec=reference-maker - which should be a good thing :) Except that there is no restriction to be multi-language, so you get the full force of python string-format.

What I want to know is: having figured out the paths, which concat and merge coords each one belongs to, how do we extract the appropriate zarr key? Do I understand that this is exactly what FilePattern does?

So now for the case where each of the files itself contains multiple chunks, e.g., a smaller number of very large HDFs or a view over several zarrs. So you may have a chunk "var/0.0" within a filepath "root/concat1_merge1" so perhaps it would be equivalent to output key "var/1.1.0.0". Is this possible with the code here?

It is nice when the the data can be described by the file-path like this, but do you maintain the ability to scan all the files in the way that xr.open_mf would do when necessary? Since it only needs the metadata and perhaps (small) coords blocks, this is not unreasonable. reference-maker would not need to access the actual data blocks, so no need for caching whole files.

martindurant · 2021-04-27T14:40:08Z

Finally (and this is distinct enough to be a separate comment), the "idea" I mentioned in the meeting. This is a feature that gets requested in dask sometimes.

Supposing you have some parquet datasets (each perhaps many files) in various places, but you want to view the whole as a single dataset. Parquet has a convention that directories are datasets, and that you can encode categorical partition values into the folder names. Thus, a filesystem view of the original data could be used to construct the ensamble dataset without moving the originals, and we could interpret where in the ensamble the member files would be with either a file-pattern as described here, or some attribute found from the dataset metadata or each input. We could even make _metadata files for these global views.

rabernat · 2021-04-27T14:44:42Z

having figured out the paths, which concat and merge coords each one belongs to, how do we extract the appropriate zarr key? Do I understand that this is exactly what FilePattern does?

That is not part of the FilePattern object, because the FilePattern doesn't know or care about the target dataset. Instead, that logic lives in the Recipe object. The logic is relatively simple if there is a fixed number of "items" are in each file (i.e. if nitems_per_file is set explicity in the FilePattern). But we can't assume this in general for Pangeo Forge (see #50). So instead we have some rather nasty logic to figure out the mapping between pangeo_forge chunks (should maybe be named to something less ambiguous, like "write blocks"), zarr chunks on disk, and input files.

https://github.com/pangeo-forge/pangeo-forge/blob/85291a7853f05ea26e1032eb819d6d54de056d59/pangeo_forge/recipes/xarray_zarr.py#L493-L523

rabernat · 2021-04-27T18:59:16Z

Merging so we can keep moving forward.

rabernat added 4 commits April 13, 2021 22:56

new pattern class

b48a470

refactored recipe module to use FilePattern; nearly everything works

47cf71b

omg all tests are green

0a5d844

make mypy work again

c62420b

pin cfgrib

87b1b9f

rabernat mentioned this pull request Apr 14, 2021

Don't overwrite cache #102

Merged

rabernat added 16 commits April 15, 2021 10:36

add repr

a9bb15e

fix exception handling

e70f7c1

remove unnecessary log message

e7d44ba

update sequential recipe

c89da51

logging updates

c104f9d

update multi-variable recipe

52c9935

add items() to FilePattern

95eea1c

update terraclimate tutorial

7a8c49c

update recipe docstring

df39eba

rearrange package organization

d835659

update import

0fde948

reorganize docs

a502432

add missing files

06762b6

add storage docs

b2759e2

expand FilePattern narrative docs

ff987b2

clean up docs

3231c4a

rabernat force-pushed the refactor-patterns branch from 4c90462 to 3231c4a Compare April 26, 2021 02:07

fix tutorial typos

85291a7

rabernat merged commit 904d91b into pangeo-forge:master Apr 27, 2021

This was referenced Apr 27, 2021

How to merge metadata from individual netcdf files when using NetCDFtoZarrSequentialRecipe? #109

Open

Pin xarray to unreleased master and rechunker to 0.4.2 for use with pip. #114

Merged

Improve serializability of Recipe and related classes #116

Closed

cisaacstern mentioned this pull request May 10, 2021

Issue on page /tutorials/netcdf_zarr_sequential.html #120

Closed

rabernat mentioned this pull request Dec 18, 2021

Should we just adopt xarray-beam as our internal data model? #256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor patterns #101

Refactor patterns #101

rabernat commented Apr 14, 2021 •

edited by cisaacstern

Loading

rabernat commented Apr 14, 2021

martindurant commented Apr 27, 2021

martindurant commented Apr 27, 2021

rabernat commented Apr 27, 2021

rabernat commented Apr 27, 2021

Refactor patterns #101

Refactor patterns #101

Conversation

rabernat commented Apr 14, 2021 • edited by cisaacstern Loading

What I am changing

How I did it

How you can test it

TODO

rabernat commented Apr 14, 2021

martindurant commented Apr 27, 2021

martindurant commented Apr 27, 2021

rabernat commented Apr 27, 2021

rabernat commented Apr 27, 2021

rabernat commented Apr 14, 2021 •

edited by cisaacstern

Loading