Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement data storage directory structure #31

Closed
tomkralidis opened this issue Nov 16, 2021 · 34 comments
Closed

implement data storage directory structure #31

tomkralidis opened this issue Nov 16, 2021 · 34 comments
Assignees
Milestone

Comments

@tomkralidis
Copy link
Collaborator

Create a storage directory structure for wis2node to support the various data/files managed within an installation.

  • data
    • local data processing
      • incoming
      • outgoing
    • downloaded data from other wis2node services
  • metadata
    • discovery
    • station
@tomkralidis
Copy link
Collaborator Author

+ @petersilva

Initial proposal to feed discussion:

|____data
| |____outgoing  # for publication
| |____public  # data available via PubSub and STAC
| | |____YYYY-MM-DD
| | | |____originating-centre
| | | | |____data-category
| | | | | |____dataset-name
| |____incoming  # incoming (CSV) data to be processed
|____metadata  # to feed API publication and OSCAR/Surface caching
| |____discovery
| |____station

@tomkralidis
Copy link
Collaborator Author

Based on discussions with @petersilva

Example structure: /<RFC3339>/<source>/<ISO-3166-2>/<originating-centre>

@tomkralidis
Copy link
Collaborator Author

tomkralidis commented Dec 1, 2021

updated proposal:

|____data
| |____config
| |____outgoing
| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____country-code
| | | | | |____originating-centre
| | | | | | |____data-category
| | | | | | | |____dataset-name
| |____incoming
| | |____YYYY-MM-DD
| |____errors
| | |____YYYY-MM-DD
|____metadata
| |____discovery
| |____station

@efucile
Copy link
Member

efucile commented Dec 2, 2021

my suggestion

|____data
| |____config
| |____outgoing
| |____public
| | |____YYYY-MM-DD
| | | |____source #please define source. Is it the an identifier for the broker?
| | | | |____tree type = land observations #different tree structure depending on land, ocean, satellite obs or other products
| | | | | |____country-code #ocean, satellite, NWP data don't have country-code
| | | | | | |____originating-centre
| | | | | | | |____data-category
| | | | | | | | |____dataset-name
| |____incoming
| | |____YYYY-MM-DD
| |____errors
| | |____YYYY-MM-DD
|____metadata
| |____discovery
| |____station

@tomkralidis
Copy link
Collaborator Author

  • /data/public/YYYY-MM-DD/source would be a fixed value of wis here
  • tree type: the thinking was that data-category would cover this? Having said this
  • country-code: perhaps we omit in lieu of originating-centre only?

Update (narrowed to /data/public):

| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____data-category/tree-type
| | | | | |____originating-centre
| | | | | | |____dataset-name

One issue would be that "data-category/tree-type" could potentially have varying levels of nesting.

cc @petersilva for comments/insight

@efucile
Copy link
Member

efucile commented Dec 2, 2021

The originating center is not a country. You may want to have a country level split and know who is generating the data

@tomkralidis
Copy link
Collaborator Author

Update based on discussion with ECMWF:

|____public
| |____YYYY-MM-DD
| | |____source
| | | |____tree-type
| | | | |____country-code
| | | | | |____originating-centre
| | | | | | |____data-category

Notes: the dataset-name (now removed) would be a composite of tree-type.country-code.originating-centre.data-category

@tomkralidis
Copy link
Collaborator Author

Update:

  • there will varying levels of hierarchy based on the data category (observations, NWP, satellite, ocean, etc.)

Building with surface observations as critical path:

  • data_category.country_code.originating_centre.station_type.representation

where:

Example: observations-surface-land.ca.cwao.landFixed.bufr

@petersilva
Copy link
Contributor

petersilva commented Dec 2, 2021

That looks really helpful, but it does not have complete coverage: station_type only mentions obs. what about: nwp outputs, forecasts, synthetic satellite imagery? or text forecasts and warnings? Also, is RADAR landFixed?

@tomkralidis
Copy link
Collaborator Author

Good points @petersilva: land surface obs is our initial iteration. The tree will evolve as we build out NWP, forecast, satellite.

@petersilva
Copy link
Contributor

If that is the hierarchy you want, then we can attempt to modify the tables in GTStoWIS2 to generate them for matching WMO386 AHL's. This is exactly the kind of feedback we have been trying to solicit to the proposed topic tree.

@petersilva
Copy link
Contributor

In discussions in GTStoWIS (TT protocols team) we have agreed to omit file format (aka representation) as it was agreed that files should have, as is universal convention outside the weather world, appropriate file extensions. As the topics correspond to file folders, having type in the topic (which is also a folder) will result in all files having the representation twice. e.g:

a/b/type/c/filename.type

The file type/data representation will unavoidably show up twice. Ideally, one looks up some form of information, say alerts, and the directory offers .cap files, .geojson files, and .crex files, and perhaps .txt ones. The committee considers the fact that data with different representations does not show up in the same directory currently to be a bug in our proposed topic tree. One that we have issues open to address ( e.g.: wmo-im/GTStoWIS2#55, wmo-im/GTStoWIS2#39 )

@tomkralidis
Copy link
Collaborator Author

Update standup 2021-12-03:

  • HH: example for SYNOP, do we extend the tree

  • DB: leave that for data inspection

  • DB: for ocean, we can use regions as named geographies

  • DB: collaborating networks are not originating centres, how do we include them as well?

  • PM: how do we deal with lake buoy data on Lake Malawi?

  • XC: what is the origin of https://github.com/wmo-im/GTStoWIS2/blob/main/GTStoWIS2/TableCCCC.json ? Seems 2x bigger than from the GTS manual (@petersilva comment?)

  • DB: ...public/YYYY-MM-DD: is this the publish date or date of observation? Should be the latter

  • DB/XC: output data filename convention can be more indicative: TODO in csv2bufr

@david-i-berry
Copy link
Member

For marine regions the high level geographies from https://www.marineregions.org/gazetteer.php?p=details&id=23616 may be useful.

@david-i-berry
Copy link
Member

  • DB: collaborating networks are not originating centres, how do we include them as well?

In terms of my comment, one of the big advantages of the WIS2.0 is that it opens up the system to observations from other communities. For example, if say an oceanographic data centre wanted to setup a WIS2.0 node and make data available via the node the proposed originating centre hierarchy may not be appropriate.

@tomkralidis
Copy link
Collaborator Author

Current state of directory structure (based on initial iteration of surface weather data):

|____data
| |____config
| |____outgoing
| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____country-code
| | | | | | |____originating-centre
| | | | | | | |____data-category
| |____incoming
| | |____YYYY-MM-DD
| |____errors
| | |____YYYY-MM-DD
|____metadata
| |____discovery
| |____station

Remaining points:

  • file type as part of topic hierarchy (@petersilva / @efucile): based on TT-Protocols recommendation, we should not include the data representation as part of the tree per se
  • collaborating networks are not always originating centres (@petersilva / @david-i-berry). Should there be guidance on how non-originating centres can define themselves at this part of the tree?

@petersilva
Copy link
Contributor

petersilva commented Dec 7, 2021

Table CCCC:

Table was built by converting the PDF online to a text file, transposing it, manual cleaning. followed by merging with the file from UCAR, and manual additions, as the data flowed for weeks, and we saw CCCC's show up that were unaccounted for. The flow is the normal bulleting flow of UCAR/UNIDATA. (deriving_CCCC subdir in GTStoWIS2 repo.)

@petersilva
Copy link
Contributor

CCCC is just a GTS transition mechanism... The mapping being done by GTStoWIS2 module is to the "centre" field in the table, which is kind of simplified, lower-case but much more readable names. "AMRF" -> "melbourne_regional_forecasting_centre" ... This use of ascii constrained simplified place names, far more readable for most than the four letter CCCC's is invented as as part of and implicit in the GTStoWIS2 proposal. This proposal is made in the absence of a higher quality source.

@efucile
Copy link
Member

efucile commented Dec 7, 2021

we could probably replace originating-center with originator. Just to clarify that it may be an entity that is not a center and that is not the previous existing table. I guess that we will need a controlled vocabulary for this and I would suggest it to go in codes.wmo.int

@petersilva
Copy link
Contributor

@efucile are you pointing to adding to http://codes.wmo.int/common/centre ? or creating a new table?

@efucile
Copy link
Member

efucile commented Dec 7, 2021

we need new tables for WIS2

@tomkralidis
Copy link
Collaborator Author

cc @amilan17

In TT-WISMD, we will start working on WCMP 2.0 codelists at https://github.com/wmo-im/wcmp2-codelists, and will kick off this activity at our next meeting.

@tomkralidis
Copy link
Collaborator Author

Updated tree:

|____data
| |____config
| |____outgoing
| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____country-code
| | | | | | |____originator
| | | | | | | |____data-category
| |____incoming
| | |____YYYY-MM-DD
| |____errors
| | |____YYYY-MM-DD
|____metadata
| |____discovery
| |____station

Closing this issue, given the implementation in wis2node proper. Formal codelist creation will be put forth at TT-WISMD.

@tomkralidis tomkralidis reopened this Dec 17, 2021
@tomkralidis
Copy link
Collaborator Author

tomkralidis commented Dec 17, 2021

Notes based on 2021-12-15 discussion:

  • mirror config and incoming directory structures with public

Next steps:

  • implement directory structure in incoming and config

@petersilva / @efucile thinking more:

| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____country-code
| | | | | | |____originator
| | | | | | | |____data-category

Example: observations-surface-land.ca.cwao.landFixed

Should we consider the data category higher up in the hierarchy, i.e.:

Example

| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____data-category
| | | | | | |____country-code
| | | | | | | |____originator

Example: observations-surface-land.landFixed.ca.cwao

Thoughts?

@petersilva
Copy link
Contributor

petersilva commented Dec 17, 2021

in hierarchies, normally, each level of the hierarchy has "control" of lower levels of the hierarchy. This principle is expressed both in the OID mechanism ( discussed here: wmo-im/GTStoWIS2#37 ) and was given by @remygiraud as a constraint on the hierarchy.

The country is supposed to be top... the national authority then can permit / assign / control / refuse the next level of the tree (centre names within the country.) Each centre has control of what it publishes under their centre-id.

So country-code.centre-id was a starting point for our committee work.

This control is also a kind of permission to write in the tree. it is natural to implement permissions that align with the hierarchy, but it is hard to see how it would be done if bulletins from every originator are scatterred throughout the tree.

@tomkralidis
Copy link
Collaborator Author

After discussion with @efucile, should we add a level to the topic hierarchy based on the WMO Unified Data Policy, so:

  • core
  • recommended
  • other ?

Example: core.observations-surface-land.ca.cwao.landFixed

Or is this a function of the discovery metadata evaluation (to assess whether to further bind to a resource)?

cc @petersilva @kaiwirt

@petersilva
Copy link
Contributor

I don't know enough about the Unified Data Policy to understand the implications. I suspect that "core/recommended/other" is a distinction that is not material to most subscribers, who will not know what it means, or why it matters. Questions:

  • Are we saying that "core" roughly corresponds to traditional WMO 386/Volume C1 data?

  • I guess a particular format of land observation is considered "core" by the WMO?

  • Presumably, a BUFR ob would be in core, and a geojson for the same location in "other"?

  • If someone defines a new template, departing from the standard one, does that BuFR have to go under "other"?

  • when the policy goes through revisions, and a new type gets accepted, do we have two locations and a transition period. (progression from "other" to "recommended" and later "core" in successive versions?)

@david-i-berry
Copy link
Member

david-i-berry commented Jan 28, 2022

Core data is described here:

https://meetings.wmo.int/Cg-Ext-2021/_layouts/15/WopiFrame.aspx?sourcedoc=/Cg-Ext-2021/InformationDocuments/Cg-Ext(2021)-INF04-1-CATALOGUE-OF-CORE-DATA_en.docx&action=default

and resolution 1 / UDP here:

https://ane4bf-datap1.s3-eu-west-1.amazonaws.com/wmocms/s3fs-public/ckeditor/files/Cg-Ext2021-d04-1-WMO-UNIFIED-POLICY-FOR-THE-INTERNATIONAL-approved_en_0.pdf?4pv38FtU6R4fDNtwqOxjBCndLIfntWeR

The format or message type is not important by my reading and I would have thought (hoped) that the BUFR and geojson for the same observation would have the same classification, i.e. if the BUFR is classed as core data then the geojson should also be considered core. If this is not the case it then gets very messy.

I would have thought the easiest way to flag/control would be to have the classification within the topic hierarchy. If we do this do we also need to consider other data licensing models, for example provision under one of the creative commons licenses? We considered this issue when making data available through the C3S data store, code table here:

https://glamod.github.io/cdm-obs-documentation/tables/code_tables/data_policy_licence/data_policy_licence.html

@petersilva
Copy link
Contributor

petersilva commented Jan 28, 2022

I don´t know that format is irrelevant... I gather (perhaps wrongly) that there are discussions between aviation and meteorological people, where the aviation community tends to want the Aviation XML to be limited access, perhaps only commercially available, and the Met community circulates BuFR obs publically with no restrictions for the same location. Bufr being harder for non-met users to deal with, both communities are happy. So I guess AvXML stuff would be non-core...

From @david-i-berry´s links the core/etc... thing is more about distribution rights and requirements (must be available at no cost, vs. potentially restricted.) I don´t think most users care what the IP regime for data they are obtaining is (beyond whether they can access it or not.)

Looking at the last link david provided... I guess we get a topic corresponding to "data policy license" with values like "Attribution-NonCommercial-ShareAlike-CC-BY-NC-SA" (likely shortened to CC-BY-NC-SA ?) to distinguish between data set licensing, and then potentially identical trees under them.. with different content depend on how each product is licensed.

hmm... Is that what people intend?

@tomkralidis
Copy link
Collaborator Author

I agree with @david-i-berry. File format/representation should be independent from data management/identification.

@david-i-berry
Copy link
Member

My takeaway from the WIS2node standup call today is that we only want to worry about WMO data and that other sources are out of scope. In defining the topic tree we don't want to include WIS but do want to include the data category / license.

From the WMO Unified Data Policy we have the following definitions:

  1. Members shall provide on a free and unrestricted basis the core data that are necessary for the provision of services in support of the protection of life and property and for the well-being of all nations, at a minimum those data described in Annex 1 to this resolution which are required to monitor and predict seamlessly and accurately weather, climate, water and related environmental conditions;
  2. Members should also provide the recommended data that are required to support Earth system monitoring and prediction activities at the global, regional and national levels and to further assist other Members with the provision of weather, climate, water and related environmental services in their States and Territories. Conditions may be placed on the use of recommended data;

giving rise to two branches

| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____core
| | | | | | |____country-code
| | | | | | | |____originator

and

| |____public
| | |____YYYY-MM-DD
| | | |____source
| | | | |____tree-type
| | | | | |____recommended
| | | | | | |____country-code
| | | | | | | |____originator

For data with restrictions in the recommended branch those restrictions would be specified in the metadata and not the tree itself.

@petersilva
Copy link
Contributor

petersilva commented Jan 31, 2022

@david-i-berry ... that sounds like what I heard...possible optimization: source in the above tree is "WIS" ... from the discussions, @efucile was advocating getting rid of WIS, so we could just promote 'core' and 'restricted' up three levels, and have... say... WMO-Core just under the date, and WMO-Recommended, WMO-Other... etc... we save a level in the tree... and still get the concept in there.

@petersilva
Copy link
Contributor

sorry, I just noticed tree-type is separate from "recommended" or "core" ... perhaps I misunderstood... I initially thought they were the same thing... I don't know what tree-type is. I edited the previous comment... to omit discussion of tree-type.

@tomkralidis
Copy link
Collaborator Author

Implemented in initial iteration. Will evolve in parallel with direction from WMO topic hierarchy efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants