Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example pipeline for IMERG #5

Open
davidbrochart opened this issue Jul 30, 2020 · 30 comments
Open

Example pipeline for IMERG #5

davidbrochart opened this issue Jul 30, 2020 · 30 comments

Comments

@davidbrochart
Copy link

Source Dataset

IMERG is a dataset of 0.1° half-hourly precipitation estimates over the majority of the Earth's surface from 2000-present.

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

1 Zarr store - chunks oriented for both time series and spatial analysis.

@abarciauskas-bgse
Copy link

@davidbrochart - I'm interested in picking up this ticket but curious if you know of use cases or users I should be aware of in generating the Zarr store. This could inform how we select variables and chunk configuration for the output.

@davidbrochart
Copy link
Author

Thanks @abarciauskas-bgse, I don't have much time to work on it, so I would be happy if you could pick it up.
I have already uploaded some GPM IMERG data at gs://pangeo-data/gpm_imerg/late/chunk_time, and I used this script to do so:
https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py
You should find information in the Zarr store and in the script, but otherwise don't hesitate to ask me.

@davidbrochart
Copy link
Author

@abarciauskas-bgse I'm curious about your progress on this recipe, and I would definitely like to contribute, if you need my help.

@abarciauskas-bgse
Copy link

Hey David, thanks for reaching out. @sharkinsspatial and I are working on this together. So far I have been working on adapting the example-pipeline for IMERG, see https://github.com/developmentseed/example-pipeline/tree/abarciauskas-bgse_imerg

Steps working are fetching and storing the HDF files, what's not working is combine_and_write which makes sense as the source is now the IMERG HDF5 files and not the SST NetCDF files. So my next step (🤞 tomorrow) was to be to review your code for translating the HDF5 files to Zarr

@sharkinsspatial is working on a cloud deployment of prefect using Fargate + Dask which I know less about.

@davidbrochart
Copy link
Author

Awesome, I just remembered this issue, but that was a long time ago and maybe it's not relevant anymore.

@abarciauskas-bgse
Copy link

Ah yes thanks for pointing this out @davidbrochart - I started finagling with a url pattern but once I realized there's things about the URL pattern that don't easily translate from a datetime I started just using beautiful soup to parse all the HDF5 file links from each of the julian day-level parent directory pages.

@davidbrochart
Copy link
Author

BTW, I see you're using this URL:

https://gpm1.gesdisc.eosdis.nasa.gov/...

instead of this one in my original script:

https://jsimpsonhttps.pps.eosdis.nasa.gov/...

Are they equivalent?

@abarciauskas-bgse
Copy link

Oh thanks for reminding me about the difference in sources; I had difficulty signing up for a PPS (the registration process seems too send me in circles). Could you share an example file from https://jsimpsonhttps.pps.eosdis.nasa.gov with [email protected] so I can compare with the same datetime as from gpm1.gesdisc?

From a while back I vaguely recall that you are able to get data in "real time" from the jsimpsonhttps PPS source but given I was having trouble registering I moved to the gpm1.gesdisc source since it just requires URS Earthdata credentials.

Follow up questions about the specific product we might want to use. There are a couple of options of the preciptation product:

  • Time span: the half-hourly product is of course more granular but only comes in HDF5. As an initial go I might use the daily product since it is in netCDF but do you know of an important use case that necessitates the half-hourly product?
  • "Late Run" vs "Final Run" - do you know if there is a preference from the science community on which product is used more and why?

    The half-hourly Final Run product uses a month-to-month adjustment to the monthly Final Run product, which combines the multi-satellite data for the month with GPCC gauge analysis. The adjustment within the month in each half hour is a ratio multiplier that's fixed for the month, but spatially varying.
    The Late Run is computed about 14 hours after observation time, so sometimes a microwave overpass is not delivered in time for the Late Run, but subsequently comes in and can be used in the Final. This would affect both the half hour in which the overpass occurs, and (potentially) morphed values in nearby half hours.

@abarciauskas-bgse
Copy link

Actually looking at https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py#L118-L121 might get me what I need to generate the zarr store using the half-hour product. 🙌

@davidbrochart
Copy link
Author

Oh thanks for reminding me about the difference in sources; I had difficulty signing up for a PPS (the registration process seems too send me in circles). Could you share an example file from https://jsimpsonhttps.pps.eosdis.nasa.gov with [email protected] so I can compare with the same datetime as from gpm1.gesdisc?

I just sent you an email.

From a while back I vaguely recall that you are able to get data in "real time" from the jsimpsonhttps PPS source but given I was having trouble registering I moved to the gpm1.gesdisc source since it just requires URS Earthdata credentials.

That's weird, I have not tried recently, but for me the registration was easy.

Follow up questions about the specific product we might want to use. There are a couple of options of the preciptation product:

  • Time span: the half-hourly product is of course more granular but only comes in HDF5. As an initial go I might use the daily product since it is in netCDF but do you know of an important use case that necessitates the half-hourly product?

I would say it's always better to have the data with the best resolution, because we can generate the other resolutions by using e.g. xarray's coarsen method.

  • "Late Run" vs "Final Run" - do you know if there is a preference from the science community on which product is used more and why?

Yeah that's a good question. I guess I chose the late run because it was a trade-off between the final run and the early run 😄
It depends on the usage, but some people might want to wait in order to get the best product, while others will want to have the data as quickly as possible even if it's not perfect. I guess the later case (early run) will have more value when we are able to continuously update the data set, which I'm not sure pangeo-forge supports at the moment.

@davidbrochart
Copy link
Author

Actually looking at https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py#L118-L121 might get me what I need to generate the zarr store using the half-hour product.

Yes, there is everything in there, but I agree it's not that easy to read 😄

@abarciauskas-bgse
Copy link

Thanks @davidbrochart I am working on a script to generate the zarr store using your code, but of course running into the limits of my zarr chunking experience with how to handle the time dimension. There is the added complexity that right now the download code doesn't download to the original filename but to a hash of the source url (so files are not necessarily going to be listed "in order" according to their datetime)

Perhaps we can find a time to discuss how to handle this - it looks like your code generates the time chunks more so "from scratch" so wondering if that is the only way. Will coordinate a time over email if that works for you!

@abarciauskas-bgse
Copy link

Update @davidbrochart I still have to work out the Zarr chunking but I think we can simplify the Zarr store generation using the code in https://github.com/developmentseed/example-pipeline/blob/abarciauskas-bgse_imerg/create_zarr.py

This seems to work for creating a zarr store that looks like (for 4 files):

>>> dsz = xr.open_zarr(target_zarr)
>>> dsz
<xarray.Dataset>
Dimensions:                         (lat: 1800, latv: 2, lon: 3600, lonv: 2, nv: 2, time: 4)
Coordinates:
  * lat                             (lat) float32 -89.95 -89.85 ... 89.85 89.95
  * lon                             (lon) float32 -179.95 -179.85 ... 179.95
  * time                            (time) object 2000-06-01 01:00:00 ... 200...
Dimensions without coordinates: latv, lonv, nv
Data variables:
    HQobservationTime               (time, lon, lat) timedelta64[ns] dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    HQprecipSource                  (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    HQprecipitation                 (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    IRkalmanFilterWeight            (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    IRprecipitation                 (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    lat_bnds                        (time, lat, latv) float32 dask.array<chunksize=(1, 1800, 2), meta=np.ndarray>
    lon_bnds                        (time, lon, lonv) float32 dask.array<chunksize=(1, 3600, 2), meta=np.ndarray>
    precipitationCal                (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    precipitationQualityIndex       (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    precipitationUncal              (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    probabilityLiquidPrecipitation  (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    randomError                     (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    time_bnds                       (time, nv) object dask.array<chunksize=(4, 2), meta=np.ndarray>
Attributes:
    GridHeader:  BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...

... but wanted you too take a look at this in case this method doesn't handle certain needs of this dataset I'm not aware of.

@davidbrochart
Copy link
Author

Indeed that is much simpler!
I'm wondering what's in the time coordinate though, is it out of order?
Also, what are latv, lonv and nv?

@abarciauskas-bgse
Copy link

Time comes in order, although I think this required adding the argument combine="by_coords"

<xarray.DataArray 'time' (time: 95)>
array([cftime.DatetimeJulian(2000, 6, 1, 0, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 0, 30, 0, 13, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 1, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 1, 30, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 2, 0, 0, 13, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 2, 30, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 3, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 3, 30, 0, 13, 2, 153),

I am having trouble finding a data dictionary that explains these variables latv, lonv and nv but I think they are "bounds" for the lat, lon and time variables. If you have any success tracking down a data dictionary let me know 😅

@rabernat
Copy link
Contributor

Can these datasets be opened by xarray? If so, this might be ready to go with the latest version of pangeo forge. If not, please open an issue in pangeo-forge/pangeo-forge to describe what extra functionality is needed.

@ciaransweet
Copy link

ciaransweet commented Jan 26, 2021

@rabernat the HHR dataset (example filename: 3B-HHR.MS.MRG.3IMERG.20200930-S000000-E002959.0000.V06B.HDF5) can be opened with xarray's open_dataset function, but I had to pass in the parameter group="Grid" to get it working when I was playing with the GPM IMERG data.

Complete example:

from xarray import open_dataset

path_to_file = "3B-HHR.MS.MRG.3IMERG.20200930-S000000-E002959.0000.V06B.HDF5"
dataset = open_dataset(path_to_file, group="Grid")

The same is needed for the MO (3B-MO.MS.MRG.3IMERG.20200901-S000000-E235959.09.V06B.HDF5) data but the DAY data (3B-DAY.MS.MRG.3IMERG.20200930-S000000-E235959.V06.nc4) can be opened just with open_dataset(path_to_file)

@davidbrochart
Copy link
Author

davidbrochart commented Jan 26, 2021

@CiaranEvans @abarciauskas-bgse I'm also working on this recipe. I opened an issue in pangeo-forge to report the problems I'm having. In addition to passing group="Grid" to xarray, we also need to pass credentials:

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=input_urls,
    sequence_dim="time",
    inputs_per_chunk=4,
    xarray_open_kwargs={'group': 'Grid'},
    fsspec_open_kwargs={'client_kwargs': {'auth': aiohttp.BasicAuth('username', 'password')}}
)

I have opened pangeo-forge/pangeo-forge-recipes#59 for that.
I also have an issue with the time coordinate.
I don't think I can submit a PR for this recipe in https://github.com/pangeo-forge/staged-recipes before pangeo-forge/pangeo-forge-recipes#59 and pangeo-data/rechunker#77 are merged, but it looks like we should coordinate if we all work on the same recipe.

@ciaransweet
Copy link

Hey @davidbrochart - Definitely, I don't want to re-implement something you've already done well!

We were actually hoping to convert it to COG (Cloud Optimised GeoTiff) - Though I'm now wondering if that is worth another ticket.

@davidbrochart
Copy link
Author

We were actually hoping to convert it to COG (Cloud Optimised GeoTiff) - Though I'm now wondering if that is worth another ticket.

Good question. Is it possible for a recipe to have multiple outputs @rabernat (Zarr and COG in this case)?

@ciaransweet
Copy link

From everything I've read so far, it seems the majority of Recipes are Zarr heavy - We'd quite like COG to become a first class output too, even if that means something like having both NetCDFToZarr and NetCDFToCog

@rabernat
Copy link
Contributor

YES to COG! We just need an issue to track that. We would like to support many different input and output formats.

@ciaransweet
Copy link

@rabernat cool, I can post an issue on here? Or should it go to pangeo-forge instead?

@rabernat
Copy link
Contributor

The idea is that here we post ideas for specific recipes and on pangeo-forge we post specific atomic feature enhancements needed to support those recipes. Since we already have a motivating recipe (this one), I think we just need the issue for COG output.

@ciaransweet
Copy link

Okay, so COG specific features to go to pangeo-forge correct?

@rabernat
Copy link
Contributor

I think this should be able to work with the latest master of pangeo-forge (now that pangeo-forge/pangeo-forge-recipes#59 is in).

We still don't have a working system for formally submitting new recipes. I would recommend making a new repo and putting a single recipe.py file in it for now.

@rabernat
Copy link
Contributor

p.s. I assume you don't want to put the credentials directly in git. I imagined using github secrets for this. So the recipe could be configured to pull the secrets from a github workflow environment variable. I believe this would be secure and convenient.

@davidbrochart
Copy link
Author

I think this should be able to work with the latest master of pangeo-forge (now that pangeo-forge/pangeo-forge#59 is in).

It would be nice to have a release of rechunker, although we can also use the latest master.

@davidbrochart
Copy link
Author

p.s. I assume you don't want to put the credentials directly in git. I imagined using github secrets for this. So the recipe could be configured to pull the secrets from a github workflow environment variable. I believe this would be secure and convenient.

I set up a repository for the GPM IMERG recipe and use GitHub secrets for the credentials. I am able to reproduce the error about cftime coordinates, see https://github.com/davidbrochart/pangeo-forge-recipes/runs/1779828142

@rabernat
Copy link
Contributor

I set up a repository for the GPM IMERG recipe

🎉

I am able to reproduce the error about cftime coordinates,

Well if you have any insight on what is going on there, PR welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants