Harvest Data

This page describes the tasks of the different working groups responsible for harvesting data for BONSAI.

Contents
Input Output Framework
Fill the beta version with default Supply and Use Tables
Documenting procedures
Re-balancing routines
Supply and Use Tables gap-filling & correction routines
Disaggregation routines
Big Data Harvesting
Data scraping & text mining
Stock data
Data Provider Interface
Data upload template
Data Parser

Input Output Framework

Supply Use Tables (SUTs) are the backbone of BONSAI, guaranteeing that there are default data available for any imaginable product, and that the database covers the entire global economy. Economic SUTs are produced and published by national statistical agencies, typically with 5 years intervals, and with inter- and extra-polation from aggregate data on an annual basis.

The data in the BONSAI database needs to be provided with a fully open license (CC-BY) that allows also commercial use.

Fill the beta version with default SUTs

Priority: High;
Estimated person-hours: 300
Volunteer(s)/Candidate(s): Stefano, Konstantin, Marie, Michele

Functional specifications: The ideal "raw" data for BONSAI are the national SUTs with environmental and social extensions from national statistical agencies and other government and academic institutions and NGOs. While SUTs directly in the form supplied by national statistical agencies will be an important source of data for BONSAI in the longer run, for the beta version we need to fill the database with relatively high quality data in a short time. A possibility could be to use WIOD SUTs, since this is open also for commercial use, although this would only be relevant to provide a proof-of-concept.

To show the full quality potential of BONSAI it may initially be necessary to have access to a larger pool of data in the short term, which could mean to work in parallel with data that are provided with a more restrictive license, but which are still open for non-commercial use (e.g. EXIOBASE Multi Regional SUTs and/or Eora Multi Regional Input Output tables, both free for universities or grant-funded research). We are also clarifying copyright issues with IEA for energy data. Such more restricted data can be tagged with a reference to the license conditions, stored in same database as the open data, and filtered out for the specific calculations.

Technical specifications: For the beta-version an option could be to make a very simple back-calculation from WIOD - with the very simple assumption that no by-products exist. The mapping could be based on countries with more detailed data (look for a presentation on the topic in the last IO-conference). The EORA database is a good example, the SUTs use different product classifications for the different countries reflecting the detail of data collection in each country.

Long-term requirements

Store license conditions with data in order to allow filtering data by license type;
Establishing procedures for the systematic identification and filling of data gaps;
Trade linking: What is a meaningful linking mechanism? Attributional trade linking (UN Comtrade) can inform about market scopes (size of regional markets), then comes the identification of marginal suppliers.
Identification of data sources to identify marginal suppliers supporting a sientifically robust approach to consequential modeling. For examples, using ecoinvent technology level, cross-elasticity information (products substitutability and complementarity)

Documenting procedures

Priority: High;
Estimated person-hours: 100
Volunteer(s)/Candidate(s): Stefano, Konstantin

Functional specifications:

Documenting procedures for back-calculating supply-use data for countries where only direct requirement matrices are available
Document and justify procedures for data processing to produce balanced hybrid tables (experiences from EXIOBASE). Make explicit the currently applied procedures to identify and correct missing data.

Technical description:

Analysing results (trace back how the results got generated - which software, which monetary/physical conversion factors, which data and methodological assumptions)
Explore and document what is already done, and refine this. (Konstantin and Stefano for exploration; coding: ??). See Wenz et al 2015 and Wood et al 2014 to get started.

The integration of data at many different levels of detail requires algorithms for automatic disaggregation and gap-filling, using global product and activity breakdowns based on industry data sources and data from countries with detailed data.

Re-balancing routines

Priority: Low; Estimated person-hours: ...
Volunteer(s)/Candidate(s):

Description of task: Routines to combine data from different sources, years, classifications, layers, levels of aggregation in common database; Balancing, validation, reprocessing occurring after filtering. Perform mass balance check and avoid double counting(cross check with FAO stats). Technical specifications: Give feedback to the filtering mechanism. All balancing (mass and monetary can be done at the same time). But final invertible matrix has to be a hybrid layer in natural units. Store balancing errors as separate vectors (re-balancing only used as last resort).

SUT gap-filling & correction routines

Priority: Intermediate;
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Stefano, Konstantin

Functional specifications: Routines for reconciling and re-balancing of the disaggregated SUTs, including hSUTs. Make explicit the currently applied procedures to identify missing data and correct errors in SUTs.

Technical specifications: Gap-filling algorithms support the creation of complete matrix (do not have to be stored in RDF format): Products and activities breakdown can be done based on countries with detailed data; the format should then be store in RDF as global vectors applicable to all countries. Algorithms could be made more sector or country-specific to improve the quality of the estimates.

Questions/Discussions: -> If we want to base ourselves on supply-use data prior to any constructs, what procedures can we use for back-calculating supply-use data for countries where only direct requirement matrices are available?

Disaggregation routines

Description of task: Build disaggregation routines (Hybridization) for industries, commodities, regions (subtracting and filling blank). Example: Match an ILCD unit process dataset to the WIOD classification (as an illustration of the matching with a more detailed classification) and place it in the database as an additional dataset without subtracting it from the sector dataset that it falls under. Disaggregation would solve problem of misaligned classifications and/or partial overlapping classifications between different datasets. (Konstantin has a PIK reference for an algorithm)

Technical specifications:

Different options for software-assisted matching of data entries to the existing classifications exist for further investigation, both to classify data input and for querying the database using natural language free-text search. Semi-automatic matching routines facilitate matching when no correspondence table exists, e.g. ISDATA search engine.
Matching is also required when databases have different levels of aggregation (e.g. industry flows and single process data) *Technical specifications:*Example of translation between numerous industrial classification codes, see unstats example.
To deal with synonyms, see elastic search.
Automated tools may be supplemented by gaming to complete and validate the matchings suggested by the automated tools.

Linking algorithms

Priority: High
Estimated person-hours:
Volunteer(s)/Candidate(s): Stefano

Functional specifications: Construction of product recipes. Links a product to a set of products. This is the case when there is not a one-to-one correspondence between what the user wants to analyze and the products of the datasets

Technical specifications:

Big Data Harvesting

Large amounts of footprint-relevant data are available in the public domain, but are not accessible in a ready-to-use and validated form for product footprints. It is possible to use automated procedures to harvest the data from the current disparate and incompatible databases and raw data sources, placing each piece of harvested data in the relevant database context.

Data scraping & text mining

Priority: intermediate;

Data scraping

Estimated person-hours: ...
Volunteer(s)/Candidate(s):

Task Description: Data Harvesting Scripts / Scrapers (Scraping of bulk data sources). Harvesting data on process description from existing literature.
Data such as FAOSTAT (or USDA data?) data could be integrated first as an example.

Technical specifications: The DeepDive project is quite interesting for this and has already been applied for tasks such as improving paleological databases.

Text mining

Estimated person-hours: ...
Volunteer(s)/Candidate(s):

Task Description: Text mining of specific data sources, corporate sustainability reports. Also, for understanding in amore structured way the raw text aready contained in LCA data. To be considered, also a 100% Manual digging and parsing of data: gaming approach

Technical specifications:

Opportunities for machine learning & to use (semi-)automate procedures to replace activities currently requiring human intervention. A practical example could implementing text mining of specific data sources, e.g. corporate sustainability reports.

Data updates and adding new data points: Potential to assign tasks to Master students, group work, classroom projects. One of the flow-property layers has to be defined as the natural unit for each product. The natural unit is the one that the product cannot loose without loosing its meaning.

Stock data

Priority: Low;
Estimated person-hours:

Task Description: Storing stocks information (economic and environmental), separate from accumulation of stocks, to serve material flow community;

Technical specifications:

Data Provider Interface

This working Group should allow for data upload. A data format template will be specified to allow users to add data into the RDF store. To further integrate the new data in the existing RDF store, the template will have to specify the 6 dimensions (Activity, Flow-object, Flow-property, Time, Location, Macro-economic scenario) corresponding to each datapoint. This way, for example, if the data provided is region-specific when the database only contains country-specific data, then the parser should be able to “relocate” the dataset within the RDF store using the pre-established correspondence tables.

Data upload template

Priority: High;
Estimated person-hours: 100
Volunteer(s)/Candidate(s):

Functional specifications: The template facilitates data uploads. Users should be able to upload new data by using a standardized template for data addition. Data will then be integrated in calculations if fulfilling the validation requirements with the Data parser. We can hint at how editing will in the future be allowed (versioning and review implementation)

The template should provide the requirements regarding the formatting of the raw data, allowing uploading of incomplete or conflicting datasets.

The Template should keep comments and annotations to encourage experts to supply extra information (RDF can handle this). Three options: 1) take extra data, user provides label (suggestions are provided from drop-down), 2) just store additional data without user refinement), 3) discard additional info. Only accepts datasets that cover a time period of minimum 1 year as in ecoinvent.

Data parser

Priority: High;
Estimated person-hours: 100
Volunteer(s)/Candidate(s):

Functional specifications: The browser-based interface should allow to upload and parse xls files with process information. The six identifiers (activity, flow, flow-property, time, space, scenario) are parsed and the DB parser suggests a machine-made interpretation of the data, which the provider of data can accept, modify, or use to refine his process description. That helps the data provider by providing feedback on the readability of data before uploading. Three options:

Take extra data, user provides label (suggestions are provided from drop-down),
Just store additional data without user refinement),
Discard additional information A quality-score could be attached to conflicting data from competing sources. Only accepts datasets that cover a time period of minimum 1 year as in ecoinvent.

Technical specifications: All formats are text based. Example of tool for format conversions. The format converter should turn any entry in a Resource Description Framework (RDF) format. Data format conversion is also required to convert RDF to CSV format for matrix inversion.

XML schemas can be translated to RDF/OWL via XSLT conversion of ILCD to RDF

Excel template --> parsing to RDF store (Stefan)
Ecospold2 files --> parsing to RDF store (Chris Mutel)
ILCD files --> parsing to RDF store? (start with (https://github.com/cbdavis/ILCD-to-RDF)) Browser-integration of the above is required.

Upload of Data at any level of disaggregation

Priority: Low;
Estimated person-hours: 100
Volunteer(s)/Candidate(s):

To allow detailed modelling of very specific activities, the core database also needs to be open for storing data at any level of disaggregation:

For very specific industrial or household activities,
For very specific products and environmental flow-objects,
For complete production functions of an activity,
At any level of geographical detail,
As well as for single datapoints (flows) for activities or flow-objects for which this datapoint is the only one available at that level of detail.
Supply and Use Tables

Technical specifications:

Interface for batch editing of RDF store, using difference files? Needs user cases and stress testing

Expanding Excel template parsing to cover more data fields and less well-structured files(machine-made interpretation of the data, which the provider of data can accept, modify, or use to refine the description)
Expanding parsing to cover other file types than Excel
Allowing comments/edits on wikidata query interface to be stored as RDF difference files?
What sort of provenance checks need to be done on public edits?
->Flagged reviews is an extension used for some Wikipedia language versions so that people can contribute, but their edit does not show up until verified. We could also think of a semi-automatic way of correcting the editing that does not meet the quality requirements e.g. allow an error margin for mass balance issues.