Make Data Usable

This page describes the tasks of the working groups responsible for making BONSAI data usable.

Contents
Product System Algorithms
Datasets interaction algorithms
Unit converter
Uncertainty propagation
Generation of custom geographic data
Markets and marginal production mixes routines
[Data access]((https://github.com/BONSAMURAIS/bonsai/wiki/Make%20Data%20Usable#data-access)
Query parser on triplestore
Natural language interface, API
Data download
User communities

Product System Algorithms

The Product System algorithms – calculation rules – will be used to calculate product “footprints” from the original unlinked unit process data. The algorithms will be documented and made available as open-source. The first step of this procedure consists in using Linking algorithms to transform the raw data (e.g. Supply Use Table) into a Direct Requirements Matrix (DRM) including the impact assessment vector; the second step consists in calculating the Leontief inverse of the Direct Requirement Matrix to obtain the Product footprints. The Product System Algorithms guarantee mass-balanced systems and provide comparable “footprints”.

Dataset interaction algorithms

Priority: Intermediate;
Estimated man-hour: ...
Volunteer(s)/Candidate(s): Stefano Merciai, Jannick Schmidt

Description of task: Algorithms handling the interaction between:

global and local datasets, e.g. production volume conflicts
datasets for different time periods for now-casting and scenarios forecasting purposes and the temporal linkinkg of data for stock modelling (see "Storing stock data" in Big Data Harvesting
the temporal linking of waste treatment datasets, e.g. handles flows between activities with different time periods
handling of temporal delays in impact pathways e.g. stocks
propagation of flow properties (propagates e.g. wood density; heating values from supplying activity to receiving activity or vice versa and handles conflicts between such property information). Handling interaction between flow-property layers
matching geographically specified data (handling how flows are specified between source and sink activities at different locations, "transforming activity - market - transforming activity" and "environmental mechanism - source/sink mixes - environmental mechanism")
procedures for generating data for future macroeconomic scenarios

Technical specifications: The diagram below represents the current diagram of the algorithm handling the dataset interactions for conversion of the EXIOBASE monetary tables into hybrid units.

diagram of current algorithm in GAMS

Questions/Discussions: -> If we want to base ourselves on supply-use data prior to any constructs, what procedures can we use for back-calculating supply-use data for countries where only direct requirement matrices are available? -> How to create more efficient algorithms? Much of machine learning relies on gradient descent algorithms that run on GPUs, and there are many packages that allow for working with matrix data directly see Torch packages maths.md.

Uncertainty propagation

Priority: Itermediate;
Estimated man-hour: ...
Volunteer(s)/Candidate(s):

Description of task: Algorithms handling the propagation of uncertainty from raw data to calculated/reconciled data and handling conflicts between flow properties arising when data propagate (e.g. wood density; heating values etc. from supplying activity to receiving activity or vice versa)

Technical specifications:

Generation of custom geographic data

Priority: Low
Estimated man-hour: ...
Volunteer(s)/Candidate(s): Chris Mutel

Description of task: Implement tool for handling geographical location (e.g. How to define the Rest of the World? Global? or Europe without Switzerland?)

Matching geographically specified data (handling how flows are specified between source and sink activities at different locations, "transforming activity - market - transforming activity" and "environmental mechanism - source/sink mixes - environmental mechanism")

Technical specifications:

Example of a repository that contains the scripts and data needed to build a consistent topology of the world (provinces, countries, and states). It also includes the ability to define recipes to generate custom locations. The repository is a mix of SQL, bash scripts, and Python. See the file topology-journal.rst for instructions and journal of what was done and why.

Markets and marginal production mixes routines

Priority: Low
Estimated man-hour: ...
Volunteer(s)/Candidate(s): Bo, Stefan, Konstantin, Stefano

Description of task: Routines for markets and marginal production mixes. Identify marginal suppliers as weighted average of the suppliers that can change their capacity.

Technical specifications: Implement validations required to deal with all possible co-production situations. Represent marginal suppliers in matrix form to match SUT.

Unit converter

Priority: intermediate;
Estimated man-hour: ...

Task Description: Raw data may be in mixed units (e.g. monetary, physical, energy). A units ontology is necessary to perform conversions to standard units (kg) and recognize the types of units (mass, energy, etc) being used.

Technical specifications: (needs to be edited)

Use property information to estimate from e.g. monetary to physical/energy layer, the property would give the conversion factor (0.1 EUR/MJ electricity from wind)
User may submit "patch" to existing data or data that needs to be validated into the existing schema in order to be incorporated.
Conversion Factors, property relations. These could be filtered based on the level of detail - search for the industrial sector, then see if specific country is matched, then year. In practice, the SPARQL OPTIONAL with SORT could be used for this - this would prioritize conversion factors that have the most amount of filled data in detail.
- Attributes - like assumptions (Dry matter content - transfer coefficients, HHV, grade of ores, supplemental clarifying information).
- Extra supporting information to aid with balancing (originating activity -> consuming activity);
- Macroeconomic data that gives information about how the flows should add up e.g. Country GDP;
In principle, the property information should be tied with knowledge about classifications.

Data access

The query parser on the triplestore will allow users to query the database to extract the data that will be used for the Direct Requirement Matrix. The matrix is a linked version of the Supply Use Table, representing a linear, homogeneous steady-state model of the economy, where each activity has only one product output, and all product inputs to an activity is the product of another activity, thus providing a linked product system for each product of each activity. The SPARQL Protocol and RDF Query Language will be used as the query language of the Semantic Web.

Query parser on triplestore

Priority: High;
Estimated man-hour: 150
Volunteer(s)/Candidate(s): Chris Davis

Functional specifications: The query parsel should allow users to query the database to extract the data that will be used for the direct requirement matrix.

Technical specifications: Make JSON API available for querying and accessing the RDF store (see CKAN. As SQL queries relational databases, SPARQL queries RDF data. The result of a SPARQL query can be a result set (as in SQL) but it can also be RDF graph - subset of the original graph queried.

Beta-version:

Extract data from SUT, using graphical user interface with pre-defined queries.
It should be possible to filter data by license type, in order to look only for open-source information, or only data by a particular user (include reprocessing of data after filtering)

Test: Using R, the WIOD database files were converted to RDF and, placing a Fuseki server in front, it is possible to query on the RDF database.

Demonstrate SPARQL queries on RDF store:

SPARQL by example by Cambridge Semantics
Using SPARQL with Enipedia
SPARQL Queries for Statistics
The speed could be a limitation but there are possibilities to reduce the file size.

Natural language interface

The natural language interface, also called the footprint query interface, is an Application Programming Interface (API). It should allow users to obtain footprints from a free text search. To allow for more speedy access, the pre-calculated footprints will be stored in a separate database (could be a relational database). The user should be able to ask a simple free text question like "What is best: A match or a lighter?" The interface then will use fuzzy text matching to find the most likely corresponding pre-calculated footprints for a match and a lighter, using default (functional) units. This implies that semi-automatic matching routines will be developed to assist users in match missing queried object with the closer proxy available in the database. The result page will provide a visualization of the product externalities as a percentage of the product’s price. The footprints query system will also be made available through a mobile phone application.

Priority: High
Estimated man-hour: 200h according to Eaternity's "Minimal Viable Product" (MVP) (This corresponds to 25 Storypoints (SP) each one corresponding to a day)
Volunteer(s)/Candidate(s): (preliminary) Manuel Klarmann, Dominik Stefancik, Matthias Munder, Jens Hinkelmann

Beta version:

Free text interface with query parser on pre-calculated footprints
Match between user queries and the database available products.
Allow users to query the database using natural language free-text search

It provides, for queries concerning a single product, externalities expressed as a percentage of the price of the product (price = 100%, externality = e.g. 43% on top of price), with uncertainty, presented graphically.

*Functional specifications: Interface to show product comparison (e.g. "match vs lighter") For graphics:

visualizing the two products side by side.
visualizing one product in comparison with the average of all products of the same category (e.g. human vs average of country)
visualizing one product in comparison with the best performing 20% of all products of the same category

Requirements beyond the beta version

Auto-generated suggestions for improving a specific search may be proposed to the user, such as specifying units, or specifying geography and time.
Graphical tools for the Direct Requirement Matrix and Leontief inverse matrix, providing nice visualisations, like Sankey diagrams (examples by Brandon Kuczenski or Google or pymrio (use initially existing tools to the extent possible).
Important challenge is to visualise the meaning of negative flows in an intuitive way, e.g. by using name changes instead of sign changes in the user interface.
Allow users to extract data from the SUT, using graphical user interface with pre-defined queries. Users only to see the download link.
For product comparison, if uncertainty allows a clear statement, provide which option has the smallest footprint (understood as social cost), possibly with an additional graphics as above. If uncertainty does not allow a clear statement: uncertainty is too large to say which one has the lowest smallest footprint, the graph should show an indication of the largest contributor to the uncertainty.
Graphical access to Direct Requirement Matrix and Leontief inverse matrix, using existing user friendly visualizations, like Sankey diagrams or pymrio, visualize the meaning of negative flows in an intuitive way.
See Research Data Alliance recommendations on Data citation of evolving data.
Introduce BONSAI core data and product footprint data into wikipedia
Specifications: BONSAI could be linked to other general data providers like Wolfram Alpha (open for any additional data source).

Data download

Description of task: Both the raw data and the calculated product footprints shall be available to the general public in an easily accessible way (e.g. CSV). The interface should also allow to download the entire database. Users only see the download link.

Technical specifications:

User communities

The purpose is to support specific user communities, both within the existing LCA/footprint community and outside (for example in the material flow and economic modelling communities), in designing web-tools for interacting with the basic data of the BONSAI database. A higher number of users increases the interest in maintaintn and improving the database.

The aim is to allow users to specify additional data filters and system algorithms for specific needs (for example, using only data that have been additionally reviewed, algorithms that support specific legal requirements, and commercial add-ons to the otherwise Open Source database).

Address CGE modellers: cross-elasticities can be interesting, dynamic aspects, scenario vs. short-term perturbation
Address detailed IAM modellers: MARKAL, IMAGE.