Data Storage

This page describes the tasks related to establishing the data storage of BONSAI, including types of data to be stored, format issues, storage security, etc.

Establish core data structure
Specify minimum core data and metadata formats
Clarification of minimum security issues
Specify storage of source code
Establish storage for calculated product footprints

Establish core data structure

Priority: High
Estimated person-hours: 70
Volunteer(s)/Candidate(s): Matteo Lissandrini; Emil Riis Hansen

Functional specification:

Establish a maintainable and expandable core triplestore able to be filled with a limited amount of preliminary data showing application examples of the core data and metadata.

The argument for using the RDF triple format (subject - predicate - object) and storing data as separate datapoints (flow-object, activity, flow) is that this provides the largest flexibility in designing data structures, which is not the case in relational or object based data structures. At the same time, the format can be interpreted by a computer, which is not the case with raw data (plain text). The RDF format (subject - predicate - object) can be used to relate a subject to an object through the predicate, e.g., "Production of 1 kg steel - has input of - 2 kg coal". Such triples combined create a graph of data that has no predetermined boundaries. The graph is also not limited by any predetermined structure beyond the structure of a triple. This is a crucial feature of the RDF format. The creator of the graph is not required to know the resulting data structure before starting to add information to it. Metadata can be stored as graphs and as property paths queries. These allow traversing a network of facts until some sort of terminating condition is met.

For another example of RDF database development see Bio2RDF.

To apply e.g. contribution analysis on the calculated footprints, not only raw data need to be stored, but also the linked data (the Direct Requirements Matrices) needs to be stored. In addition to the RDF stores, other database formats can be used when speed of search and calculation makes this desirable.

In the longer run, it may be considered to place the triple-store on IPFS. IPFS was developed out of http://dat-data.com - where the second focuses specifically on scientific data (see IPLD and https://github.com/ipfs/faq/issues/119 and the discussion near the end of https://github.com/ipfs/ipfs/issues/36 ). A parallel that focus more on social media data is SOLID.

Technical specifications:

Since RDF is an abstract syntax, the triples need to be serialised in a concrete RDF syntax. As default syntax, we use JSON-LD. For exchange of large amounts of RDF and for processing with line oriented tools, N-triples may also be relevant. RDF Translator provides a conversion tool between the different serialisations.

Several database management systems exist for RDF. The currently most widely used is Apache Jena. An alternative to Apache Jena that has been considered is the open data management service CKAN (Comprehensive Knowledge Archive Network). CKAN is built with Python on the backend and Javascript on the frontend, and uses The Pylons web framework and SQLAlchemy as its ORM. Its database engine is PostgreSQL and its search is powered by SOLR. See the EUDAT funded assessment of CKAN vs. Apache Jena.

There are also many converters for external data to RDF, some of the most relevant are CSV2RDF for tabular data, D2RQ for relational database data, and XSLT for XML formatted data.

The latter allows the full use of the tools of the SDMX community, which is sponsored and used by UN, the World Bank, EUROSTAT, OECD, and others, and is specially developed for exchange and sharing of open data and metadata among these organisations. The tools, which all have as basis the ISO 17369 SDMX standard, are provided by the developers of this standard. Probably the most relevant in this context is the Fusion Registry, which contains all that is needed to set up a management system for distributed data. Support contracts are available from MetaData Technology Ltd. While the SDMX community currently works by default in XML (SDMX-ML), the RDF Data Cube Vocabulary is also based on SDMX, enabling smooth translation from SDMX-ML to RDF, see also the work of Sarven Capadisli.

Specify minimum core data and metadata formats

Priority: High;
Estimated person-hours: 70
Volunteer(s)/Candidate(s): Matteo Lissandrini, Agneta Ghose, Bo Weidema

Functional specifications: Provide further specifications for the formats of:

Datapoint:
- Number
- Unit
- Uncertainty (not required from data providers)
- Input/output relation
- Reference unit
The 6 identifying data dimensions of each datapoint:
- Flow-object
- Activity
- Property
- Location
- Time
- Macro-economic scenario
Metadata:
- Author
- Provenance
- Structured metadata
- Free text

Technical specifications: We should use as much as possible the same specifications as in the ecoSpold2 format, the OLCA schema and generally accepted standards, such as the SDMX standard, using as far as possible generally used web-ontologies, such as SKOS:

Number: Floating point numbers xsd:float. This also allows the use of "NaN" (Not a Number) to represent non-available data as different from zero (however, in RDF, it would normally be preferable simply not to provide the corresponding triple, as it does not add any relevant information). Note also that ecoSpold2 has the option of providing a @mathematicalRelation that defines a mathematical formula which can also contain variables and which will fill the value of the @amount if @isCalculatedAmount is TRUE. This is a very explicit and recommendable option for providing provenance information. In ecoSpold2 the formula are defined by a sub-set of the OpenFormula standard. Other RDF-related formula standards are described on the Wikipedia-page for MathML. It should be noted that uncertainty for a calculated number is calculated from the uncertainty of the contributing numbers plus any uncertainty on the mathematical relation itself. Pascal Lesage has worked on a routine for this and may provide more specifications (or code).
Unit: A systematic, extensive and recent (2018) Comparison and Evaluation of Ontologies for Units of Measurement by Keil & Schindler found that the OM ontology contains the most unit conversion factors, most units with specified dimensions and quantity-kinds, and has the fastest reaction to error reporting. Currency codes are maintained in ISO 1417 and also part of the FAO geopolitical ontology. Particular attention should be paid to currency conversions because currency values change over time (inflation-correction) and currency exchange rates exist in both nominal and purchasing power corrected versions. For utility valuation, currencies may also be multiplied with different distributional equity-weights. Different valuations (basic price, producers' price, purchasers' price) expresses the different prices for the same product before and after trade, and is not a unit issue per se, but can rather be treated as different (but related) properties (see also Figure 5 in Weidema, Ekvall & Heijungs 2009).
Uncertainty: Could be adopted from the OLCA schema, except for using xsd:float instead of xsd:double to avoid storing with unnecessary precision (while recommending to perform calculations with "double" for precision, see Ernerfeldt 2017). For formula language, see under Number. OLCA limits the UncertaintyType to normal, log-normal, triangular and uniform, where ecoSpold2 has additionally the lesser used beta, gamma, binomial, and "undefined", which allows storing practically any kind of uncertainty information. The ecoSpold2 also have numerical fields for pedigree data quality indicators. Quantitative uncertainty should not be required from data providers, since this could limit data supply unnecessarily. When required for calculation, missing quantitative uncertainty can be inferred from other data and using default estimated values.
Input/output relation: Whether a flow is an input or an output of an activity. In RDF, this relation must be made explicit as "isInputOf" or "isOutputOf", while in some raw data sources, this is provided by a schema convention, e.g. that inputs are indicated with a negative sign and outputs with a positive sign.
Reference unit: This is the quantity of the flow to which the quantity of a datapoint should be seen as proportional to. For example, a flow of a specific quantity of CO2 from an activity may be related to another flow of this activity (e.g. 1 km) or to a time period (it may for example be the annual operation of this activity). In RDF, these relations must be made explicit for each datapoint, while in most raw data sources, the reference unit is implicit in the underlying schema (e.g. a schema-specification can be that numbers are always given per unit of the determining output of the activity).
For Flow-objects, Activities, see further details under classifications
Properties: Properties are attributes/characteristics of flow-objects, activities or flows. Numerical properties should remain flexible in relation to what they refer to (e.g. mass per flow unit or mass per unit output of the activity) to avoid locking in to a specific data format. It is useful to distinguish between two types of flow-object properties: Balanceable flow-object properties (properties for which the sum for all input flows must equal the sum for all output flows, such as dry mass, wet mass, water mass, elemental mass, monetary value, person-time) and other (non-balanceable) flow-object properties (e.g. feedstock yields, lifetime, heating values, biodegradability), since the former type is an identifying dimension of the datapoints.
Time: We follow - for the time being - the limitation of ecoinvent to only accept datapoints that cover a time period of minimum 1 year. Rather than the relatively complex OWL-Time ontology, the most simple ontological notation is probably prov:startedAtTime and prov:endedAtTime using xsd:dateTime. This should be interpreted as the duration for which the datapoint is valid in contrast to the timestamps for data collection and subsequent data handling.
Location: Possibility to use GeoNames or GeoSPARQL (geospatial ontology). Heed recommendations from the UNEP/SETAC sub-group on LCIA regionalization (Chris Mutel).
Author and provenance: How data is generated? where did it come from? how was it processed? etc. Use PROV.
Structured metadata: Initially, a field is required to distinguish determining products from by-products and wastes. Since the identification of determining products is algorithm-based it may later be possible to make this a non-obligatory field.

Clarification of minimum security issues

Priority: High;
Estimated person-hours: 5
Volunteer(s)/Candidate(s):

Description of task: This description needs to cover topics such as access control, access monitoring, nefarious code injection, and other relevant security issues.

Specify storage of source code

Priority: Done

Description of task: Besides the core database BONSAI will store algorithms and software code for estimating missing data, and accessing and working with data. Storing algorithms will allow users to transparently reproduce calculations, reuse algorithms on new data and develop derived algorithms with a clear provenance trail.

Technical specifications: The BONSAI source code should be made available from this public repository in Github: BONSAMURAIS

Establish storage for calculated product footprints

Priority: Low;
Estimated person-hours: 5
Volunteer(s)/Candidate(s): Domain experts and professional data providers

Functional specifications: Although product footprints may be produced on the fly from the the Direct Requirements Matrix, the calculated footprints may need to be stored (temporarily) to allow more speedy access.

Technical specifications: Provide a store for calculated product footprints in a form that makes them quickly accessible for web-queries, e.g. as csv files.