Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor patterns #101

Merged
merged 22 commits into from
Apr 27, 2021
Merged

Refactor patterns #101

merged 22 commits into from
Apr 27, 2021

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Apr 14, 2021

This is a big refactor that closes #99.

What I am changing

I have attempted to separate concerns more clearly by expanding the scope of the patterns module and reducing the scope of the recipes module. The patterns.FilePattern class now contains a more flexible and comprehensive representation of file patterns, allowing an arbitrary number of "combine dimensions" to create a hypercube of files. (Internally it is stored as an xarray dataset). A function passed to this class tells how to generate the filename from the keys. A concrete example from the tests

concat = ConcatDim(name="time", keys=list(range(3)))
merge = MergeDim(name="variable", keys=["foo", "bar"])
def format_function(time, variable):
    return f"T_{time}_V_{variable}"
fp = FilePattern(format_function, merge, concat)

This is a bit java-esque in its use of classes, but I find it useful to have these things be very explicit.

Because of this, the Recipe class has a lot less to worry about. It now takes a FilePattern as an argument and figures out what to do with it through introspection. Now there is only one recipe class: XarrayZarrRecipe.

How I did it

Refactored the code while leaving the test suite mostly unchanged. Good tests FTW! 🎉

How you can test it

  • Try to create a new recipe using this new syntax.
  • Try to implement a new type of recipe

TODO

  • Reorganize code by moveing XarrayZarrRecipe into its own module inside a recipes subdirectory.
  • Update docstrings
  • Update tutorials
  • Update narrative documentation

@rabernat
Copy link
Contributor Author

cfgrib-related error, looks like an environment problem

``` import xarray as xr /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/__init__.py:3: in from . import testing, tutorial, ufuncs /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/tutorial.py:14: in from .backends.api import open_dataset as _open_dataset /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/__init__.py:6: in from .cfgrib_ import CfGribDataStore /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:18: in import cfgrib /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/__init__.py:19: in from .cfmessage import CfMessage /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/cfmessage.py:29: in from . import messages /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/messages.py:28: in import eccodes # type: ignore /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/__init__.py:15: in from .eccodes import * /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/eccodes.py:12: in from gribapi import __version__ /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/__init__.py:13: in from .gribapi import * # noqa /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/gribapi.py:32: in from .bindings import ENC, ffi, lib /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/bindings.py:29: in from ._bindings import ffi, lib E ImportError: /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header Error: Process completed with exit code 4. ```

Looks like there was a cfgrib release a few days ago: https://github.com/ecmwf/cfgrib/releases/tag/0.9.9.0

@rabernat rabernat mentioned this pull request Apr 14, 2021
@rabernat rabernat force-pushed the refactor-patterns branch from 4c90462 to 3231c4a Compare April 26, 2021 02:07
@martindurant
Copy link
Contributor

Sorry, some disperate thought coming up.

Your design here is rather similar to the "gen" feature of fsspec=reference-maker - which should be a good thing :) Except that there is no restriction to be multi-language, so you get the full force of python string-format.

What I want to know is: having figured out the paths, which concat and merge coords each one belongs to, how do we extract the appropriate zarr key? Do I understand that this is exactly what FilePattern does?

So now for the case where each of the files itself contains multiple chunks, e.g., a smaller number of very large HDFs or a view over several zarrs. So you may have a chunk "var/0.0" within a filepath "root/concat1_merge1" so perhaps it would be equivalent to output key "var/1.1.0.0". Is this possible with the code here?

It is nice when the the data can be described by the file-path like this, but do you maintain the ability to scan all the files in the way that xr.open_mf would do when necessary? Since it only needs the metadata and perhaps (small) coords blocks, this is not unreasonable. reference-maker would not need to access the actual data blocks, so no need for caching whole files.

@martindurant
Copy link
Contributor

Finally (and this is distinct enough to be a separate comment), the "idea" I mentioned in the meeting. This is a feature that gets requested in dask sometimes.

Supposing you have some parquet datasets (each perhaps many files) in various places, but you want to view the whole as a single dataset. Parquet has a convention that directories are datasets, and that you can encode categorical partition values into the folder names. Thus, a filesystem view of the original data could be used to construct the ensamble dataset without moving the originals, and we could interpret where in the ensamble the member files would be with either a file-pattern as described here, or some attribute found from the dataset metadata or each input. We could even make _metadata files for these global views.

@rabernat
Copy link
Contributor Author

having figured out the paths, which concat and merge coords each one belongs to, how do we extract the appropriate zarr key? Do I understand that this is exactly what FilePattern does?

That is not part of the FilePattern object, because the FilePattern doesn't know or care about the target dataset. Instead, that logic lives in the Recipe object. The logic is relatively simple if there is a fixed number of "items" are in each file (i.e. if nitems_per_file is set explicity in the FilePattern). But we can't assume this in general for Pangeo Forge (see #50). So instead we have some rather nasty logic to figure out the mapping between pangeo_forge chunks (should maybe be named to something less ambiguous, like "write blocks"), zarr chunks on disk, and input files.

https://github.com/pangeo-forge/pangeo-forge/blob/85291a7853f05ea26e1032eb819d6d54de056d59/pangeo_forge/recipes/xarray_zarr.py#L493-L523

@rabernat
Copy link
Contributor Author

Merging so we can keep moving forward.

@rabernat rabernat merged commit 904d91b into pangeo-forge:master Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible refactoring of interface between recipe and file patterns
2 participants