-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor patterns #101
Refactor patterns #101
Conversation
cfgrib-related error, looks like an environment problem
```
import xarray as xr
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/__init__.py:3: in
from . import testing, tutorial, ufuncs
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/tutorial.py:14: in
from .backends.api import open_dataset as _open_dataset
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/__init__.py:6: in
from .cfgrib_ import CfGribDataStore
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:18: in
import cfgrib
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/__init__.py:19: in
from .cfmessage import CfMessage
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/cfmessage.py:29: in
from . import messages
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/cfgrib/messages.py:28: in
import eccodes # type: ignore
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/__init__.py:15: in
from .eccodes import *
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/eccodes/eccodes.py:12: in
from gribapi import __version__
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/__init__.py:13: in
from .gribapi import * # noqa
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/gribapi.py:32: in
from .bindings import ENC, ffi, lib
/usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/bindings.py:29: in
from ._bindings import ffi, lib
E ImportError: /usr/share/miniconda/envs/pangeo-forge/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header
Error: Process completed with exit code 4.
```
Looks like there was a cfgrib release a few days ago: https://github.com/ecmwf/cfgrib/releases/tag/0.9.9.0 |
4c90462
to
3231c4a
Compare
Sorry, some disperate thought coming up. Your design here is rather similar to the "gen" feature of fsspec=reference-maker - which should be a good thing :) Except that there is no restriction to be multi-language, so you get the full force of python string-format. What I want to know is: having figured out the paths, which concat and merge coords each one belongs to, how do we extract the appropriate zarr key? Do I understand that this is exactly what So now for the case where each of the files itself contains multiple chunks, e.g., a smaller number of very large HDFs or a view over several zarrs. So you may have a chunk "var/0.0" within a filepath "root/concat1_merge1" so perhaps it would be equivalent to output key "var/1.1.0.0". Is this possible with the code here? It is nice when the the data can be described by the file-path like this, but do you maintain the ability to scan all the files in the way that xr.open_mf would do when necessary? Since it only needs the metadata and perhaps (small) coords blocks, this is not unreasonable. reference-maker would not need to access the actual data blocks, so no need for caching whole files. |
Finally (and this is distinct enough to be a separate comment), the "idea" I mentioned in the meeting. This is a feature that gets requested in dask sometimes. Supposing you have some parquet datasets (each perhaps many files) in various places, but you want to view the whole as a single dataset. Parquet has a convention that directories are datasets, and that you can encode categorical partition values into the folder names. Thus, a filesystem view of the original data could be used to construct the ensamble dataset without moving the originals, and we could interpret where in the ensamble the member files would be with either a file-pattern as described here, or some attribute found from the dataset metadata or each input. We could even make |
That is not part of the FilePattern object, because the FilePattern doesn't know or care about the target dataset. Instead, that logic lives in the Recipe object. The logic is relatively simple if there is a fixed number of "items" are in each file (i.e. if |
Merging so we can keep moving forward. |
This is a big refactor that closes #99.
What I am changing
I have attempted to separate concerns more clearly by expanding the scope of the
patterns
module and reducing the scope of therecipes
module. Thepatterns.FilePattern
class now contains a more flexible and comprehensive representation of file patterns, allowing an arbitrary number of "combine dimensions" to create a hypercube of files. (Internally it is stored as an xarray dataset). A function passed to this class tells how to generate the filename from the keys. A concrete example from the testsThis is a bit java-esque in its use of classes, but I find it useful to have these things be very explicit.
Because of this, the
Recipe
class has a lot less to worry about. It now takes aFilePattern
as an argument and figures out what to do with it through introspection. Now there is only one recipe class:XarrayZarrRecipe
.How I did it
Refactored the code while leaving the test suite mostly unchanged. Good tests FTW! 🎉
How you can test it
TODO
XarrayZarrRecipe
into its own module inside arecipes
subdirectory.