-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example pipeline for IMERG #5
Comments
@davidbrochart - I'm interested in picking up this ticket but curious if you know of use cases or users I should be aware of in generating the Zarr store. This could inform how we select variables and chunk configuration for the output. |
Thanks @abarciauskas-bgse, I don't have much time to work on it, so I would be happy if you could pick it up. |
@abarciauskas-bgse I'm curious about your progress on this recipe, and I would definitely like to contribute, if you need my help. |
Hey David, thanks for reaching out. @sharkinsspatial and I are working on this together. So far I have been working on adapting the example-pipeline for IMERG, see https://github.com/developmentseed/example-pipeline/tree/abarciauskas-bgse_imerg Steps working are fetching and storing the HDF files, what's not working is @sharkinsspatial is working on a cloud deployment of prefect using Fargate + Dask which I know less about. |
Awesome, I just remembered this issue, but that was a long time ago and maybe it's not relevant anymore. |
Ah yes thanks for pointing this out @davidbrochart - I started finagling with a url pattern but once I realized there's things about the URL pattern that don't easily translate from a datetime I started just using beautiful soup to parse all the HDF5 file links from each of the julian day-level parent directory pages. |
BTW, I see you're using this URL:
instead of this one in my original script:
Are they equivalent? |
Oh thanks for reminding me about the difference in sources; I had difficulty signing up for a PPS (the registration process seems too send me in circles). Could you share an example file from https://jsimpsonhttps.pps.eosdis.nasa.gov with [email protected] so I can compare with the same datetime as from gpm1.gesdisc? From a while back I vaguely recall that you are able to get data in "real time" from the jsimpsonhttps PPS source but given I was having trouble registering I moved to the gpm1.gesdisc source since it just requires URS Earthdata credentials. Follow up questions about the specific product we might want to use. There are a couple of options of the preciptation product:
|
Actually looking at https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py#L118-L121 might get me what I need to generate the zarr store using the half-hour product. 🙌 |
I just sent you an email.
That's weird, I have not tried recently, but for me the registration was easy.
I would say it's always better to have the data with the best resolution, because we can generate the other resolutions by using e.g. xarray's coarsen method.
Yeah that's a good question. I guess I chose the late run because it was a trade-off between the final run and the early run 😄 |
Yes, there is everything in there, but I agree it's not that easy to read 😄 |
Thanks @davidbrochart I am working on a script to generate the zarr store using your code, but of course running into the limits of my zarr chunking experience with how to handle the time dimension. There is the added complexity that right now the download code doesn't download to the original filename but to a hash of the source url (so files are not necessarily going to be listed "in order" according to their datetime) Perhaps we can find a time to discuss how to handle this - it looks like your code generates the time chunks more so "from scratch" so wondering if that is the only way. Will coordinate a time over email if that works for you! |
Update @davidbrochart I still have to work out the Zarr chunking but I think we can simplify the Zarr store generation using the code in https://github.com/developmentseed/example-pipeline/blob/abarciauskas-bgse_imerg/create_zarr.py This seems to work for creating a zarr store that looks like (for 4 files):
... but wanted you too take a look at this in case this method doesn't handle certain needs of this dataset I'm not aware of. |
Indeed that is much simpler! |
Time comes in order, although I think this required adding the argument
I am having trouble finding a data dictionary that explains these variables |
Can these datasets be opened by xarray? If so, this might be ready to go with the latest version of pangeo forge. If not, please open an issue in pangeo-forge/pangeo-forge to describe what extra functionality is needed. |
@rabernat the Complete example: from xarray import open_dataset
path_to_file = "3B-HHR.MS.MRG.3IMERG.20200930-S000000-E002959.0000.V06B.HDF5"
dataset = open_dataset(path_to_file, group="Grid") The same is needed for the |
@CiaranEvans @abarciauskas-bgse I'm also working on this recipe. I opened an issue in pangeo-forge to report the problems I'm having. In addition to passing recipe = NetCDFtoZarrSequentialRecipe(
input_urls=input_urls,
sequence_dim="time",
inputs_per_chunk=4,
xarray_open_kwargs={'group': 'Grid'},
fsspec_open_kwargs={'client_kwargs': {'auth': aiohttp.BasicAuth('username', 'password')}}
) I have opened pangeo-forge/pangeo-forge-recipes#59 for that. |
Hey @davidbrochart - Definitely, I don't want to re-implement something you've already done well! We were actually hoping to convert it to COG (Cloud Optimised GeoTiff) - Though I'm now wondering if that is worth another ticket. |
Good question. Is it possible for a recipe to have multiple outputs @rabernat (Zarr and COG in this case)? |
From everything I've read so far, it seems the majority of Recipes are Zarr heavy - We'd quite like COG to become a first class output too, even if that means something like having both |
YES to COG! We just need an issue to track that. We would like to support many different input and output formats. |
@rabernat cool, I can post an issue on here? Or should it go to |
The idea is that here we post ideas for specific recipes and on pangeo-forge we post specific atomic feature enhancements needed to support those recipes. Since we already have a motivating recipe (this one), I think we just need the issue for COG output. |
Okay, so COG specific features to go to |
I think this should be able to work with the latest master of pangeo-forge (now that pangeo-forge/pangeo-forge-recipes#59 is in). We still don't have a working system for formally submitting new recipes. I would recommend making a new repo and putting a single |
p.s. I assume you don't want to put the credentials directly in git. I imagined using github secrets for this. So the recipe could be configured to pull the secrets from a github workflow environment variable. I believe this would be secure and convenient. |
It would be nice to have a release of rechunker, although we can also use the latest master. |
I set up a repository for the GPM IMERG recipe and use GitHub secrets for the credentials. I am able to reproduce the error about cftime coordinates, see https://github.com/davidbrochart/pangeo-forge-recipes/runs/1779828142 |
🎉
Well if you have any insight on what is going on there, PR welcome! |
Diff PR against fork's master branch.
Source Dataset
IMERG is a dataset of 0.1° half-hourly precipitation estimates over the majority of the Earth's surface from 2000-present.
Transformation / Alignment / Merging
Files should be concatenated along the time dimension.
Output Dataset
1 Zarr store - chunks oriented for both time series and spatial analysis.
The text was updated successfully, but these errors were encountered: