providers.txt

==========
Fri Jan 19 10:13:42 -0800 2018

The New Face of LcResource Providers, after the Great Context Refactor:

0. Archives have a source, reference, and namespace uuid

A central authority ties reference to namespace uuid, and delivers these out to clients after authentication (or for free, in the case of public resources).  Knowing the namespace uuid allows a client to correctly map a key to a UUID and retrieve information about the entity.  Alternately, clients that simply know the UUID can query for it but may not have access to retrieve the entity.

An LcResource-- contained in the interfaces package-- is a spec for creating an archive.

The interfaces package also contains a factory for creating a small subset of archives-- those that do not require the resources package-- to wit, anything that only depends on the BasicArchive: Qdb, Antelope clients, and Foregrounds.

0.1 Foregrounds

The purpose of a foreground is to collect (or create) observations from both the real world and from catalogued resources.

Contents of a foreground:
 - flows
 - contexts
 - quantities
 - fragments

LcForeground inherits from BasicArchive and adds fragments and ForegroundImplementation; cannot store processes.  Foregrounds do not have an nsuuid ((??))-- they require a catalog and use UUIDs from catalog references. However, they have a bespoke list of fragment names that allow users to give non-uuid external refs to their fragments. AND they have the Qdb upstream so the user can still __getitem__ using canonical names.

When new flows are created, they may be renamed in the foreground- which is why nsuuid should not be used to generate uuids; instead uuid4 should be used.

0.1.1 Transitioning Foregrounds to use link instead of uuid

In order for foregrounds to store entities from multiple origins with the same uuid or external ref, foregrounds will start using links as the default keys.  This should not be too hard to deal with, with the proviso that if multiple entities have the same external ref, one will be picked nondeterministically.

Maybe we should transform the entire system to use links instead of uuids.  What is the advantage of uuids, if we consider them to no longer be unique? why do we even have uuids?

 - they are unique within an origin; plus, lots of people already use them.  foregrounds ARE special in that they are designed to hold information from many sources, and in fact they NEED to be able to distinguish between differently-versioned entities in ways that other BasicArchive subclasses don't.


The proposed plan is:

 * create a _uuid_map defaultdict that maps uuid to a set of links
 * _key_to_id returns a valid key to _entities
   - in LcForeground, it accommodates links, uuids (nondeterministically), or fragment names
 * change _add signature to _add(entity, key)
   - update BasicArchive.add() to use entity.uuid as key
   - update LcForeground.add() to use entity.link as key
 * some minor plumbing to ensure that _key_to_id returns uuid expectation is not violated in client code

and that's it!


0.2 Archives

LcArchive is contained in the resources package-- it provides the ability to handle reference processes, allocations, and compute background systems (with partial ordering and foreground generation).  The main purpose of an archive is to extract tagged entities from a data source, and to store and make available two composite data types:

 = Exchange
   - relates one process to one flowable in one direction, with an optional termination
   - Exchange value is a proportionality between the magnitude of the one flow and the process's reference

 = Characterization
   - relates one flowable to one quantity in one context, with an optional locale
   - Characterization factor is a proportionality between the magnitude of the quantity and the flow's reference 

Concisely, foregrounds are for collecting observed exchanges; the Qdb is for collecting observed characterizations.

LcArchive inherits from BasicArchive and adds Inventory, Background, and Configure; cannot store fragments.

LcArchives (with background) can produce exchange lists that can be used to generate fragments in foreground systems; similarly, foreground systems (once constructed) can be "archived" into JSON archives in which fragment systems (under a given scenario specification) are hardened into unit processes. (this still TODO)


I. Archives store contexts, flowables, quantities, and processes (and sometimes fragments)

All entities have a UUID. 

Entity uuids are either supplied directly via LcEntity(), or computed from the namespace via LcArchive.new(etype, name).  Do away with LcEntity.new()

In the ILCD (and ecospold2) case, UUIDs identify flows (or intermediate or elementary exchanges), which are flowable + context.  We want to store these too, right? so ILCD archives (and maybe ecoinvent archives too) have a _flow_list dict, which maps uuid to a flow+context pair for quick retrieval.  We can investigate whether this should be a more general purpose feature, but I think it should only be used in instances where archives already have stable UUIDs that we want to preserve-- but these will not be serialized.

I.0 Accessing root data

Archives are also the framework that should be used to provide generalized access to source data (e.g. execute XML queries).. still need to figure this out, but there should be some way to ask an archive to perform an operation on the source file for a given entity-- the implementation would have to be subclass-specific since obviously flows and quantities don't have universal representation in different archive subclasses.

Well, the xml-based ones [unevenly] implement .objectify() so that at least gets us the raw info.

I.1 Characteristics of Entity Types:

Context:
 - Name is unique ID
 - Reference is nominal direction. Not sure how to figure this out. One idea:
   = _sink: +1 for every time the context is observed at the end of an output flow;
   = _source: +1 for every time the context is observed at the end of an input flow;
   = direction: Input if (_sink - _source) < 0 else Output
   = this works fine as long as an archive has 0 or few inverted processes
 - optional parent (Adding a parent will update parent's internal list of children)

Quantity: (no changes)
 - Name or unitstring is unique ID (unitstring only in deficient data systems e.g. ecospold v1)
 - Reference Unit
 - optional UnitConversion property as implemented

Flowable: (deprecate Compartment property in favor of context entity)
 - Name is unique ID
 - reference quantity-- has characterization factor 1 in all contexts
 - required CasNumber (may be blank)
 - optional Classification (e.g. 'halogenated organic compound')
 - Characterizations: unordered set of
    (quantity, flowable, context) tuples with an included dict of [location]->factor

Process: (no changes) (only permitted in LcArchive subclasses)
 - UUID is unique ID (could be derived from name)
 - required SpatialScope, TemporalScope
 - optional Classification (e.g. 'plastics and rubber manufacturing')
 - Exchanges: unordered set of
    (process, flowable, direction, termination) tuples, as implemented
 
Fragment: (no changes) (only permitted in LcForeground / subclasses)
 - UUID4 or user-set name is unique ID
 - required flowable + direction
 - optional parent fragment is reference entity
 - optional balance setting, background setting, exchange_value
 - constructed exchange values + terminations by scenario


I.2 


Below is crufty-- or already implemented


==========
Tue Jan 09 10:16:59 -0800 2018

The classes in the 'providers' subdirectory all function to open different data sources and extract information into the archive format (or into the interface specification)

This helps me clarify what exactly is required to subclass an LcArchive-- I think it's mainly two things:

 - _fetch(uid)
 - _load_all()

_load_all() is easy-- just list the processes and load them.

_fetch() is where all the complexity is: need to recursively fetch quantities for flows for exchanges, just like others.


Thu 2018-01-11 15:34:51 -0800

Now that we've [finally] gotten the resource generation working right, we need to think about a couple things:

 1- modifying archives.  It would be really slick if making alterations to an archive automatically assigned it a new semantic reference.  We could do that with archive tools, but there is no real way for an archive-- let alone a resource-- to detect when entities stored within it are modified.  What we can do is create an export function that spins out a new archive by appending a new semantic tag or by incrementing / modding the trailing tag.

 right now we can create archives, with arbitrary names, and they get stored in the catalog directory.  This is just a wrapper around create_source_cache() which is a bit indirect.

 To do above, what we want to do is essentially change the resource's reference-- which means relocating it in the resolver's index.

 THe problem is: non-JSON-type resources should not be writable.  The only kinds of things you should be able to save from a non-writable resource are indices and archives.  (but the save function needs to still be enabled for these resources for that reason).  When you want to make a new archive, you should do so explicitly.

We should somehow log the provenance of the archive? I guess by enforcing similarity in the semantic ref.

So what should we be allowed to do? only append.  And the append

When we create a NEW resource, we should append an 8-character date code to it; spinoff references should have the date removed, additional sgment appended, new date appended.

Just to recap-- what we are doing to local.uslci.ecospold is creating local.uslci.ecospold.clean.20180112, which we do by the following steps:
 + user supplies new semantic name segment
 * construct new semantic name
 * load_all()
 * alter archive's source
 * make_cache(_archive_dir/new.semantic.name.json.gz)
 * resource from existing archive (source = cache, ref = new.semantic.name, dstype = json; interfaces, privacy, priority, args kept)
 *

Fri 2018-01-12 13:16:11 -0800
In fact, almost all of this should happen inside the archive itself.  [done]

Now, what we must do is:
 + user supplies new semantic name segment
 * archive creates descendant
 * resource from existing archive
 * remove archive from old resource

==========
Mon Jan 15 15:22:59 -0800 2018

That's all done-- now the problem is persistent configuration.  How should this work?

 - configuration options are established by the archive providers that are being configured.

 - The four core configuration options, established in LcArchive, are:
    - ConfigSetReference -- specify a reference exchange for a process
    - ConfigBadReference -- remove the reference designation for an exchange
    - ConfigFlowCharacterization -- specify a quantity + value (and optional location) for a flow
      	: note: this will get complicated, requiring context + quantity + value + optional location after refactor
    - ConfigAllocation -- specify a quantity to be used for partitioning allocation
 
 - a resource author has to manually configure the resource-- this should be done interactively with something that knows all the various config options available.

 - config options are:
   = named tuples? just sequences of values to be assigned to a particular config method
   = subclasses? hold their own serialization + validation checking properties

The usage I want to demonstrate is:
 - configuration is already set and stored in the resource
 - configuration gets deserialized + added by the archive
 - resource must be saved manually