-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support crawling of existing data catalogs and automatic generation of FilePatterns #410
Comments
I know a package which implements a plugin system designed to make various catalogue providers appear under a consistent API. It even has a plugins system for catalogue types and individual entries. https://intake.readthedocs.io/en/latest/ By the way, as previously trailed (and unofficial), Anaconda is finally getting behind Intake and will be using its spec as the basis for dataset/catalogue exchange on anaconda.cloud . This work is scheduled for Q4. |
Yes of course intake, thanks for the reminder. |
@cisaacstern encouraged me to chime in for this issue. I have been working for a while now on migrating the pangeo CMIP6 cloud holdings to a less manual labor intensive workflow using pangeo forge. The basic idea is to generate a large dictionary of recipes, one for each dataset (itself combined out of possibly several files). The challenges for these particular datasets are twofold:
I have some initial solutions for both of these issues implemented as 'a ton of extra logic' in this feedstock, but as mentioned above this is somewhat cumbersome to maintain. Given the scale of the CMIP6 archive it seems likely that we will eventually have to split it into several feedstocks. Having custom code duplicated across many feedstocks/recipes is not ideal. I have started to refactor some of the logic out into a stand-alone package pangeo-forge-esgf, but this external dependency currently blocks execution on pangeo-forge cloud. I think that 1. above could be a very nice test case for a plug-in architecture? But even beyond that, case 2. might be another slightly different and interesting use case. I am currently deriving all of the keyword arguments based on many range-requests and imprecise size estimates before creating the recipe. As discussed here this could actually be done much more precisely and quickly when the data has been cached already. So I guess my question ultimately is, if the proposed plug-in structure could be general enough to 'attach' during different stages of the recipe I am very keen to help anywhere I can to drive this effort forward, since it seems it might unblock my CMIP6 efforts along the way. |
A few notes re: plugins from @jbusecke and my chat this morning. For generating patterns based on ESGF queries, we thought it would be nice to be able to call FilePatterns something like this: from pangeo_forge_recipes.patterns import FilePattern
esgf_instance_id_with_wildcards = "CMIP6.PMIP.*.*.lgm.*.*.uo.*.*"
pattern = FilePattern(esgf_instance_id_with_wildcards, plugin="esgf") ...so the ESGF plugin overloads FilePattern with its own plugin-specific signature, to allow construction of a pattern as is currently implemented in https://github.com/jbusecke/pangeo-forge-esgf. Then, following on Julius's mention of plugin-specific recipe kwargs, it would be great to be able to do something like: # recipe `plugin` could be passed explicitly, or inferred from `pattern.plugin`
recipe = XarrayZarrRecipe(pattern, plugin="esgf") At the XarrayZarrRecipe level, we imagined the plugin could potentially overwrite stages of the default recipe pipeline with plugin-specific stages. With default transforms referenced from #376, in pseudocode: from pangeo_forge_recipes.plugins import registered_plugins
default_transforms = {
"open_with_fsspec": OpenWithFSSpec,
"open_with_xarray": OpenWithXarray,
"infer_xarray_schema": InferXarraySchema,
"prepare_zarr_target": PrepareZarrTarget,
...
}
@dataclass
class XarrayZarrRecipe:
file_pattern_source: FilePatternSource
plugin: Optional[str] = None
def __post_init__(self):
if self.plugin and self.plugin not in registered_plugins:
raise ValueError(f"Plugin '{self.plugin}' specified but not installed")
def to_beam(self):
transforms = default_transforms.copy(deep=True)
if self.plugin:
transforms = {
# `registered_plugins[self.plugin]` would be a dict in which the plugin optionally
# defines overrides for any of the default transforms. here, we apply any overrides
# the plugin has defined.
k: (registered_plugins[self.plugin][k] if k in registered_plugins[self.plugin] else v)
for k, v in transforms.items()
}
chained_transform = (
self.file_pattern_source
| transforms["open_with_fsspec"]
| transforms["open_with_xarray"]
| transforms["infer_xarray_schema"]
| transforms["prepare_zarr_target"]
...
)
return chained_transform |
Also cc'ing @yuvipanda & @sharkinsspatial who have interest + expertise here and looks like haven't been tagged yet. |
IMO this class method approach has a nicer UI than overloading FilePattern (as I suggested above). I agree that it's impractical to maintain these methods in |
Had a quick call with @cisaacstern, @briannapagan, @jbusecke and me today to discuss this. We decided on a very specific solution to a specific problem here. I'm going to use CMR as the example here, but should apply for other catalogs too. Someone writing a recipe for a dataset that is coming out of CMR should be able to use their existing mental model of how CMR works and use just that to write the recipe. The easiest way to do that is to make a package like from pangeo_forge_recipes_cmr import CMRRecipe
recipe = CMRRecipe(short_name="GPM_3IMERGHHL") # pass additional params here if needed And it's the responsibility of the CMRRecipe object to translate and make sure this actually provides a pangeo_forge_recipes Recipe object. This has several advantages:
I think there was general agreement that we needed some sort of plugin API as well, but this would already cover a lot of use cases with minimal fuss in a long-term sustainable way. The only feature really missing here is the ability to install arbitrary packages for use by With the end-of-September demo in mind, the next action items we decided on are:
Me and @briannapagan have a meeting scheduled for Monday at 2pm pacific to move forward here. @jbusecke @cisaacstern what can we do re: CMIP6 here? I'm also sure I missed some points of the discussion, others feel free to chime in. |
I just want to reiterate that we haven't discounted any plugin systems, just that the one feature we need for plugins (arbitrary extra packages at parse time) already unlocks something that will solve many use cases (regular wrapper libraries), so pursuing that first. |
Agree 💯 with this path @yuvipanda, thanks for proposing it, and summarizing it so clearly. Re: cmip6 use cases, this wrapper approach will be plug-and-play with https://github.com/jbusecke/pangeo-forge-esgf. 👍 Once the beam refactor is merged, this would even allow us to start experimenting with the sort of custom pipeline definitions I was brainstorming about in #410 (comment): the wrapper package could simply compose those custom pipelines itself. Looking forward to seeing this in action! Please let me know if/how/when I can help. |
There are many existing data catalogs out there. We currently require users to create a FilePattern from either a list of URLs or a formatting function and a set of keys. However, if the data are already in a catalog, these steps should be unnecessary. Instead we should be able to generate a file pattern directly from a simple query (e.g.
dataset_id="NOAA_GPCP", version=3.0
etc.)Examples of catalogs formats we might want to crawl are:
Here are some different ways we could achieve this:
Bespoke code in each recipe
This is possible today. You can write code to crawl anything you want, build a list of files, and then call
pattern_from_file_sequence
. This is what I do in the GPCP recipe.Pros: simple and flexible
Cons: hard to scale, lots of redundant code, only supports 1D FilePatterns
Functions within pangeo forge recipes package
We could imagine creating some class methods on FilePattern that enable code like this
Pros: tightly integrated with pangeo forge
Cons: potentially grows the scope of pangeo forge recipes a lot with lots of messy, format-specific code
Plugin architecture
Or instead, we could use some sort of plugin architecture that allows third party packages to provide file-pattern constructors. Then the logic for each weird catalog format could live in a standalone repo, to be maintained by people who understand that format, while integrating tightly with pangeo forge
Some different plugin approaches we could use
Pros: Clean separation of custom logic into separate repos, support the creation of private, org-specific plugins
Cons: More complex software engineering, potential challenges with testing
cc @briannapagan, who inspired this idea from her work with NASA CMR
The text was updated successfully, but these errors were encountered: