Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagating all dataset metadata #72

Open
HeatherSavoy-USDA opened this issue May 3, 2022 · 6 comments
Open

Propagating all dataset metadata #72

HeatherSavoy-USDA opened this issue May 3, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@HeatherSavoy-USDA
Copy link
Collaborator

Expand metadata about datasets to include:

  1. A notes section providing dataset-specific comments to user, e.g. DaymetV4's monthly timestamps are like YYYY-MM-DD HH but we overwrote it to be YYYY-MM.
  2. Any metadata in native file formats that can not be propagated to output file formats should be included in our metadata output.
@HeatherSavoy-USDA HeatherSavoy-USDA added the enhancement New feature or request label May 3, 2022
@HeatherSavoy-USDA HeatherSavoy-USDA changed the title expanding dataset metadata Propagating all dataset metadata May 18, 2022
@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented May 18, 2022

From @stuckyb: we could probably implement format-specific, generic metadata capture methods. E.g., NetCDF, by way of xarray, can attach arbitrary properties to variables. We could capture those as generic metadata. Unit testing would check that generic metadata capture works as expected.

@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented May 18, 2022

@HeatherSavoy-USDA
Copy link
Collaborator Author

@MelanieVeron-USDA Could you help with this issue? We currently are supporting netCDF and GeoTIFF format for rasters. Could you check what metadata entries each format can hold? Then we will see which are automatically read in by xarray and which we will need to handle ourselves (e.g. #77).

@mveron23
Copy link
Collaborator

mveron23 commented Jun 23, 2022

@MelanieVeron-USDA Could you help with this issue? We currently are supporting netCDF and GeoTIFF format for rasters. Could you check what metadata entries each format can hold? Then we will see which are automatically read in by xarray and which we will need to handle ourselves (e.g. #77).

For both file formats, there does not appear to be a set limit to how many or what kind of global attributes (metadata such as data set title, units, data description, urls to data source and metadata/informational websites, miscellaneous comments, other custom metadata tags, etc.). That said, in general, there are metadata standards that sources/file creators follow (linked below), though the method/package used to add metadata can vary.

For GeoTIFF files:

Data Curation Network - GeoTIFF Primer - This primer gives a comprehensive summary of the GeoTIFF format and what kinds of metadata it can contain, including minimum/recommended metadata elements. Links to recommended metadata standards (ISO 19139, FDGC, OGC (Open Geospatial Consortium) GeoTIFF Standard v1.1, GeoBlacklight 1.0) are also provided.

The packages rioxarray, xarray, and gdal (from the osgeo package) can be used to load GeoTIFF files and explore their metadata in python. There is a varying level of inconsistency in what metadata is included in the file (as defined by the source/author of the file) and what metadata is show (based on the package used to load the data). The general pattern is that rioxarray retains/shows more attributes (metadata) than xarray, and gdal includes extra gdal-specific metadata that are otherwise missed by both rioxarray and xarray (at minimum, the AREA_OR_POINT tag, the meaning of which is briefly explained here). That said, gdal may not include certain metadata that rioxarray includes, so it may be best practice to use both rioxarray and gdal (but not xarray) to extract as much metadata as possible.

For NetCDF files:

NetCDF Climate and Forecast (CF) Metadata Conventions - This page provides metadata standards for the NetCDF format.

The packages rioxarray, xarray, gdal (from the osgeo package), and netCDF4 can be used to load NetCDF files and explore their metadata in python. As with GeoTIFF files, there is a varying level of inconsistency in what metadata is included in the file (as defined by the source/author of the file) and what metadata is show (based on the package used to load the data). Some NetCDF files fail to open with rioxarray (unknown reason why) but do open with xarray. When the file can be opened with rioxarray and xarray, the xarray package is the one to fully include data set- and variable-level attributes/metadata (excluding gdal-specific metadata). The gdal and netCDF4 packages include all metadata, including gdal-specific metadata, from the file. For extracting metadata in python, it may be best to use xarray in conjunction with either gdal or netCDF4.

Link to GeoTIFF/NetCDF file metadata exploration python scripts & html outputs (data files not included due to size limit): scripts.zip

@HeatherSavoy-USDA HeatherSavoy-USDA self-assigned this Jul 18, 2022
HeatherSavoy-USDA added a commit that referenced this issue Sep 19, 2022
To include in metadata output when the content doesn't fit elsewhere

Per #72
@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented Sep 19, 2022

I added the first component in 888d8d7.

For the second component, it is less straightforward:

  • We lose all metadata that rioxarray reads in when we write extracted point data to any file format.
  • The metadata that is read in can be at least partially file-specific and not just dataset-specific. So how much do we include in our metadata JSON when we have point output? If just that which is dataset-specific, do we include it in the catalog like provider info?
  • DaymetV4 has different metadata in its DataArray attributes depending if its a GeoTIFF or NetCDF.

@HeatherSavoy-USDA
Copy link
Collaborator Author

This may be addressed in the implementation of #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants