Example pipeline for IMERG #5

davidbrochart · 2020-07-30T07:35:43Z

Source Dataset

IMERG is a dataset of 0.1° half-hourly precipitation estimates over the majority of the Earth's surface from 2000-present.

https://gpm.nasa.gov/data/imerg
HDF5
One file per half-hour
How are the source files accessed: HTTP download
- https://jsimpsonhttps.pps.eosdis.nasa.gov/imerg/late/200006/3B-HHR-L.MS.MRG.3IMERG.20000601-S000000-E002959.0000.V06B.RT-H5
Any special steps required to access the data: register for PPS data access

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

1 Zarr store - chunks oriented for both time series and spatial analysis.

abarciauskas-bgse · 2020-11-03T17:45:16Z

@davidbrochart - I'm interested in picking up this ticket but curious if you know of use cases or users I should be aware of in generating the Zarr store. This could inform how we select variables and chunk configuration for the output.

davidbrochart · 2020-11-03T18:05:44Z

Thanks @abarciauskas-bgse, I don't have much time to work on it, so I would be happy if you could pick it up.
I have already uploaded some GPM IMERG data at gs://pangeo-data/gpm_imerg/late/chunk_time, and I used this script to do so:
https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py
You should find information in the Zarr store and in the script, but otherwise don't hesitate to ask me.

davidbrochart · 2020-11-14T05:41:09Z

@abarciauskas-bgse I'm curious about your progress on this recipe, and I would definitely like to contribute, if you need my help.

abarciauskas-bgse · 2020-11-16T19:25:22Z

Hey David, thanks for reaching out. @sharkinsspatial and I are working on this together. So far I have been working on adapting the example-pipeline for IMERG, see https://github.com/developmentseed/example-pipeline/tree/abarciauskas-bgse_imerg

Steps working are fetching and storing the HDF files, what's not working is combine_and_write which makes sense as the source is now the IMERG HDF5 files and not the SST NetCDF files. So my next step (🤞 tomorrow) was to be to review your code for translating the HDF5 files to Zarr

@sharkinsspatial is working on a cloud deployment of prefect using Fargate + Dask which I know less about.

davidbrochart · 2020-11-17T03:44:34Z

Awesome, I just remembered this issue, but that was a long time ago and maybe it's not relevant anymore.

abarciauskas-bgse · 2020-11-17T18:44:22Z

Ah yes thanks for pointing this out @davidbrochart - I started finagling with a url pattern but once I realized there's things about the URL pattern that don't easily translate from a datetime I started just using beautiful soup to parse all the HDF5 file links from each of the julian day-level parent directory pages.

davidbrochart · 2020-11-17T21:14:52Z

BTW, I see you're using this URL:

https://gpm1.gesdisc.eosdis.nasa.gov/...

instead of this one in my original script:

https://jsimpsonhttps.pps.eosdis.nasa.gov/...

Are they equivalent?

abarciauskas-bgse · 2020-11-17T21:55:32Z

Oh thanks for reminding me about the difference in sources; I had difficulty signing up for a PPS (the registration process seems too send me in circles). Could you share an example file from https://jsimpsonhttps.pps.eosdis.nasa.gov with [email protected] so I can compare with the same datetime as from gpm1.gesdisc?

From a while back I vaguely recall that you are able to get data in "real time" from the jsimpsonhttps PPS source but given I was having trouble registering I moved to the gpm1.gesdisc source since it just requires URS Earthdata credentials.

Follow up questions about the specific product we might want to use. There are a couple of options of the preciptation product:

Time span: the half-hourly product is of course more granular but only comes in HDF5. As an initial go I might use the daily product since it is in netCDF but do you know of an important use case that necessitates the half-hourly product?
"Late Run" vs "Final Run" - do you know if there is a preference from the science community on which product is used more and why?

The half-hourly Final Run product uses a month-to-month adjustment to the monthly Final Run product, which combines the multi-satellite data for the month with GPCC gauge analysis. The adjustment within the month in each half hour is a ratio multiplier that's fixed for the month, but spatially varying.
The Late Run is computed about 14 hours after observation time, so sometimes a microwave overpass is not delivered in time for the Late Run, but subsequently comes in and can be used in the Final. This would affect both the half hour in which the overpass occurs, and (potentially) morphed values in nearby half hours.

abarciauskas-bgse · 2020-11-17T22:06:05Z

Actually looking at https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py#L118-L121 might get me what I need to generate the zarr store using the half-hour product. 🙌

davidbrochart · 2020-11-17T22:12:08Z

Oh thanks for reminding me about the difference in sources; I had difficulty signing up for a PPS (the registration process seems too send me in circles). Could you share an example file from https://jsimpsonhttps.pps.eosdis.nasa.gov with [email protected] so I can compare with the same datetime as from gpm1.gesdisc?

I just sent you an email.

From a while back I vaguely recall that you are able to get data in "real time" from the jsimpsonhttps PPS source but given I was having trouble registering I moved to the gpm1.gesdisc source since it just requires URS Earthdata credentials.

That's weird, I have not tried recently, but for me the registration was easy.

Follow up questions about the specific product we might want to use. There are a couple of options of the preciptation product:

Time span: the half-hourly product is of course more granular but only comes in HDF5. As an initial go I might use the daily product since it is in netCDF but do you know of an important use case that necessitates the half-hourly product?

I would say it's always better to have the data with the best resolution, because we can generate the other resolutions by using e.g. xarray's coarsen method.

"Late Run" vs "Final Run" - do you know if there is a preference from the science community on which product is used more and why?

Yeah that's a good question. I guess I chose the late run because it was a trade-off between the final run and the early run 😄
It depends on the usage, but some people might want to wait in order to get the best product, while others will want to have the data as quickly as possible even if it's not perfect. I guess the later case (early run) will have more value when we are able to continuously update the data set, which I'm not sure pangeo-forge supports at the moment.

davidbrochart · 2020-11-17T22:13:45Z

Actually looking at https://github.com/davidbrochart/pangeo_upload/blob/master/py/gpm2pangeo.py#L118-L121 might get me what I need to generate the zarr store using the half-hour product.

Yes, there is everything in there, but I agree it's not that easy to read 😄

abarciauskas-bgse · 2020-11-18T01:43:15Z

Thanks @davidbrochart I am working on a script to generate the zarr store using your code, but of course running into the limits of my zarr chunking experience with how to handle the time dimension. There is the added complexity that right now the download code doesn't download to the original filename but to a hash of the source url (so files are not necessarily going to be listed "in order" according to their datetime)

Perhaps we can find a time to discuss how to handle this - it looks like your code generates the time chunks more so "from scratch" so wondering if that is the only way. Will coordinate a time over email if that works for you!

abarciauskas-bgse · 2020-11-19T22:58:41Z

Update @davidbrochart I still have to work out the Zarr chunking but I think we can simplify the Zarr store generation using the code in https://github.com/developmentseed/example-pipeline/blob/abarciauskas-bgse_imerg/create_zarr.py

This seems to work for creating a zarr store that looks like (for 4 files):

>>> dsz = xr.open_zarr(target_zarr)
>>> dsz
<xarray.Dataset>
Dimensions:                         (lat: 1800, latv: 2, lon: 3600, lonv: 2, nv: 2, time: 4)
Coordinates:
  * lat                             (lat) float32 -89.95 -89.85 ... 89.85 89.95
  * lon                             (lon) float32 -179.95 -179.85 ... 179.95
  * time                            (time) object 2000-06-01 01:00:00 ... 200...
Dimensions without coordinates: latv, lonv, nv
Data variables:
    HQobservationTime               (time, lon, lat) timedelta64[ns] dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    HQprecipSource                  (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    HQprecipitation                 (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    IRkalmanFilterWeight            (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    IRprecipitation                 (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    lat_bnds                        (time, lat, latv) float32 dask.array<chunksize=(1, 1800, 2), meta=np.ndarray>
    lon_bnds                        (time, lon, lonv) float32 dask.array<chunksize=(1, 3600, 2), meta=np.ndarray>
    precipitationCal                (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    precipitationQualityIndex       (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    precipitationUncal              (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    probabilityLiquidPrecipitation  (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    randomError                     (time, lon, lat) float32 dask.array<chunksize=(1, 3600, 1800), meta=np.ndarray>
    time_bnds                       (time, nv) object dask.array<chunksize=(4, 2), meta=np.ndarray>
Attributes:
    GridHeader:  BinMethod=ARITHMETIC_MEAN;\nRegistration=CENTER;\nLatitudeRe...

... but wanted you too take a look at this in case this method doesn't handle certain needs of this dataset I'm not aware of.

davidbrochart · 2020-11-19T23:15:11Z

Indeed that is much simpler!
I'm wondering what's in the time coordinate though, is it out of order?
Also, what are latv, lonv and nv?

abarciauskas-bgse · 2020-11-20T20:15:57Z

Time comes in order, although I think this required adding the argument combine="by_coords"

<xarray.DataArray 'time' (time: 95)>
array([cftime.DatetimeJulian(2000, 6, 1, 0, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 0, 30, 0, 13, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 1, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 1, 30, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 2, 0, 0, 13, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 2, 30, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 3, 0, 0, 0, 2, 153),
       cftime.DatetimeJulian(2000, 6, 1, 3, 30, 0, 13, 2, 153),

I am having trouble finding a data dictionary that explains these variables latv, lonv and nv but I think they are "bounds" for the lat, lon and time variables. If you have any success tracking down a data dictionary let me know 😅

rabernat · 2021-01-24T15:02:58Z

Can these datasets be opened by xarray? If so, this might be ready to go with the latest version of pangeo forge. If not, please open an issue in pangeo-forge/pangeo-forge to describe what extra functionality is needed.

ciaransweet · 2021-01-26T10:12:30Z

@rabernat the HHR dataset (example filename: 3B-HHR.MS.MRG.3IMERG.20200930-S000000-E002959.0000.V06B.HDF5) can be opened with xarray's open_dataset function, but I had to pass in the parameter group="Grid" to get it working when I was playing with the GPM IMERG data.

Complete example:

from xarray import open_dataset

path_to_file = "3B-HHR.MS.MRG.3IMERG.20200930-S000000-E002959.0000.V06B.HDF5"
dataset = open_dataset(path_to_file, group="Grid")

The same is needed for the MO (3B-MO.MS.MRG.3IMERG.20200901-S000000-E235959.09.V06B.HDF5) data but the DAY data (3B-DAY.MS.MRG.3IMERG.20200930-S000000-E235959.V06.nc4) can be opened just with open_dataset(path_to_file)

davidbrochart · 2021-01-26T10:29:54Z

@CiaranEvans @abarciauskas-bgse I'm also working on this recipe. I opened an issue in pangeo-forge to report the problems I'm having. In addition to passing group="Grid" to xarray, we also need to pass credentials:

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=input_urls,
    sequence_dim="time",
    inputs_per_chunk=4,
    xarray_open_kwargs={'group': 'Grid'},
    fsspec_open_kwargs={'client_kwargs': {'auth': aiohttp.BasicAuth('username', 'password')}}
)

I have opened pangeo-forge/pangeo-forge-recipes#59 for that.
I also have an issue with the time coordinate.
I don't think I can submit a PR for this recipe in https://github.com/pangeo-forge/staged-recipes before pangeo-forge/pangeo-forge-recipes#59 and pangeo-data/rechunker#77 are merged, but it looks like we should coordinate if we all work on the same recipe.

ciaransweet · 2021-01-26T10:34:58Z

Hey @davidbrochart - Definitely, I don't want to re-implement something you've already done well!

We were actually hoping to convert it to COG (Cloud Optimised GeoTiff) - Though I'm now wondering if that is worth another ticket.

davidbrochart · 2021-01-26T10:40:34Z

We were actually hoping to convert it to COG (Cloud Optimised GeoTiff) - Though I'm now wondering if that is worth another ticket.

Good question. Is it possible for a recipe to have multiple outputs @rabernat (Zarr and COG in this case)?

ciaransweet · 2021-01-26T10:42:18Z

From everything I've read so far, it seems the majority of Recipes are Zarr heavy - We'd quite like COG to become a first class output too, even if that means something like having both NetCDFToZarr and NetCDFToCog

rabernat · 2021-01-26T13:27:41Z

YES to COG! We just need an issue to track that. We would like to support many different input and output formats.

ciaransweet · 2021-01-26T13:29:19Z

@rabernat cool, I can post an issue on here? Or should it go to pangeo-forge instead?

rabernat · 2021-01-26T13:31:04Z

The idea is that here we post ideas for specific recipes and on pangeo-forge we post specific atomic feature enhancements needed to support those recipes. Since we already have a motivating recipe (this one), I think we just need the issue for COG output.

ciaransweet · 2021-01-26T13:57:13Z

Okay, so COG specific features to go to pangeo-forge correct?

rabernat · 2021-01-27T15:41:56Z

I think this should be able to work with the latest master of pangeo-forge (now that pangeo-forge/pangeo-forge-recipes#59 is in).

We still don't have a working system for formally submitting new recipes. I would recommend making a new repo and putting a single recipe.py file in it for now.

rabernat · 2021-01-27T15:57:11Z

p.s. I assume you don't want to put the credentials directly in git. I imagined using github secrets for this. So the recipe could be configured to pull the secrets from a github workflow environment variable. I believe this would be secure and convenient.

davidbrochart · 2021-01-27T17:44:58Z

I think this should be able to work with the latest master of pangeo-forge (now that pangeo-forge/pangeo-forge#59 is in).

It would be nice to have a release of rechunker, although we can also use the latest master.

davidbrochart · 2021-01-27T20:56:06Z

p.s. I assume you don't want to put the credentials directly in git. I imagined using github secrets for this. So the recipe could be configured to pull the secrets from a github workflow environment variable. I believe this would be secure and convenient.

I set up a repository for the GPM IMERG recipe and use GitHub secrets for the credentials. I am able to reproduce the error about cftime coordinates, see https://github.com/davidbrochart/pangeo-forge-recipes/runs/1779828142

rabernat · 2021-01-28T03:54:40Z

I set up a repository for the GPM IMERG recipe

🎉

I am able to reproduce the error about cftime coordinates,

Well if you have any insight on what is going on there, PR welcome!

Diff PR against fork's master branch.

davidbrochart added the proposed recipe label Jul 30, 2020

roxyboy mentioned this issue Nov 30, 2020

Pipeline for pangeo-forge roxyboy/SWOT-AdAC-ocean-model-intercomparison#7

Open

This was referenced Jan 24, 2021

GPM IMERG recipe issues pangeo-forge/pangeo-forge-recipes#56

Closed

Problem with cftime coordinates on sequence_dim pangeo-forge/pangeo-forge-recipes#51

Closed

ciaransweet mentioned this issue Feb 2, 2021

Need for common vocabulary/visibility of work related to high-level concepts pangeo-forge/roadmap#9

Open

davidbrochart mentioned this issue Feb 11, 2021

Example pipeline for TRMM or GPM level 2 radar data #21

Open

sharkinsspatial added a commit that referenced this issue Jul 19, 2021

Merge pull request #5 from sharkinsspatial/diff_on_base

61905f1

Diff PR against fork's master branch.

wildintellect mentioned this issue Sep 8, 2021

GPM IMERG HHR recipe and meta #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example pipeline for IMERG #5

Example pipeline for IMERG #5

davidbrochart commented Jul 30, 2020

abarciauskas-bgse commented Nov 3, 2020

davidbrochart commented Nov 3, 2020

davidbrochart commented Nov 14, 2020

abarciauskas-bgse commented Nov 16, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 18, 2020

abarciauskas-bgse commented Nov 19, 2020

davidbrochart commented Nov 19, 2020

abarciauskas-bgse commented Nov 20, 2020

rabernat commented Jan 24, 2021

ciaransweet commented Jan 26, 2021 •

edited

Loading

davidbrochart commented Jan 26, 2021 •

edited

Loading

ciaransweet commented Jan 26, 2021

davidbrochart commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 27, 2021

rabernat commented Jan 27, 2021

davidbrochart commented Jan 27, 2021

davidbrochart commented Jan 27, 2021

rabernat commented Jan 28, 2021

Example pipeline for IMERG #5

Example pipeline for IMERG #5

Comments

davidbrochart commented Jul 30, 2020

Source Dataset

Transformation / Alignment / Merging

Output Dataset

abarciauskas-bgse commented Nov 3, 2020

davidbrochart commented Nov 3, 2020

davidbrochart commented Nov 14, 2020

abarciauskas-bgse commented Nov 16, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

abarciauskas-bgse commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

davidbrochart commented Nov 17, 2020

abarciauskas-bgse commented Nov 18, 2020

abarciauskas-bgse commented Nov 19, 2020

davidbrochart commented Nov 19, 2020

abarciauskas-bgse commented Nov 20, 2020

rabernat commented Jan 24, 2021

ciaransweet commented Jan 26, 2021 • edited Loading

davidbrochart commented Jan 26, 2021 • edited Loading

ciaransweet commented Jan 26, 2021

davidbrochart commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 26, 2021

ciaransweet commented Jan 26, 2021

rabernat commented Jan 27, 2021

rabernat commented Jan 27, 2021

davidbrochart commented Jan 27, 2021

davidbrochart commented Jan 27, 2021

rabernat commented Jan 28, 2021

ciaransweet commented Jan 26, 2021 •

edited

Loading

davidbrochart commented Jan 26, 2021 •

edited

Loading