-
Notifications
You must be signed in to change notification settings - Fork 3
Creating and running new modules
To run a reduction routine, a template and configuration must be defined. The Template class is defined in /dataflow/core.py, as well as other useful classes such as Module, Datatype, and Instrument. To define a template, the modules, wires, and instrument must be defined. An example data reduction can be seen in /dataflow/sample_reduction.py, where random values under a certain limit are added or subtracted from the data. If you follow along in the sample reduction routine, it is shown that the following must be completed to run the reduction:
- define the module according to the fields of the Module class
- have a data file to manipulate (in the sample reduction, the data is hardcoded into the program; in reality the file won't be needed because the template will be sent over the web as a JSON object)
- create the instrument and datatype
- register the instrument, which in turn registers the modules
- create a template and configuration (I believe the template will be sent over the web, right? Converting between the template and the wireit formats can be seen in /dataflow/wireit.py and at the end of /dataflow/sample_reduction.py)
- call run_template(template, config) to perform the reduction
The first step in defining a new module is to write a method that can create instances of that type of module. Instead of just calling the Module constructor, this extra method is needed because there are certain characteristics that all modules of this type share. For example, if you look at the method random_module, you can see that all of these so-called random modules have the same name, icon, and terminals.
To walk through this method, an icon is created first. The icon specifies the location of the image and also the offset positions and directions of the terminals, which are formatted as (x, y, dx, dy). Next, the properties of the terminals are stated. Each terminal (input and output) has an id, datatype, use, and description, but the input terminal has two additional descriptors: 'required', which is set to True if input is needed, and 'multiple', which is set to True if multiple inputs can be handled. Lastly, any fields that are going to be used should be defined in this method. For the example, random_field is defined as an integer with the name max_change. This max_change is the maximum amount of displacement from the original data; |transformed_data - data| <= |max_change|.
Now, the actual module must be created. So, an action must be defined, as well as the datatype. Right now, the datatype id is all that is needed, which is initialized to 'data1d.rowan'. The 'action' parameter in the Module constructor takes a method, so the method random_action is created. As can be seen in its helper methods, the action just modifies the data randomly and returns the result.
A few other methods, load and save, are also defined, which provided an interface between the "fake" data store. Specifically, this data store provides two files that can be loaded and transformed. When the code is actually run, f1.rowan26 is loaded, randomly transformed, and then saved with a new extension.
At this point, you should have reached the "Data and instrument definitions" section. In order to use this module, some instrument must be associated with it. In this case, the instrument ROWAN26 is created with its allowable datatypes, required scripts, and a menu of modules. In order for the module to be used, the menu must contain that module. The instrument is then registered, which in turn registers the modules it uses.
To run the reductions, a Template (from /dataflow/core.py) must be created as well as a certain configuration. The template consists of a name, description, modules, wires, and instrument id. The modules is a list of dictionaries that each represent a module that is used the reduction. Each contains the module id, position, and possible configurations. For example, the second module, rowan.random, has the configuration "'max_change': 50", which means the largest difference between the transformed data and the original data is 50. This can actually be specified in the config array or in where it is currently, which is why the config array is empty (config[i] is the configuration for module[i]). The wire array represents each of the wires in the reduction routine. In the example, the output of the load module (0th module) is fed into the input of the random module (1st module) and the output of the random module (1st module) is fed into the input of the save module (2nd module). After this, it is simple; the reductions are ran with run_template(template, config) and the result is printed.
We should also have an "activated" and perhaps a "busy" state for a module. The active flag would denote whether or not the module should be used as part of the chain, or if values should just "flow through". An example would be in the SANS pipeline--sometimes they may or may not want to put their data on an absolute scale. I picture that this will be a common motif when we provide "standard" templates for users. The template could still be used, but we could just turn a particular module "off". This could fall in with the current architecture, but let's promote it to a required part of a template (the default will be True) so that we can rely on it for calculations, diagramming, etc. rather than just a suggested practice.
Another needed addition is a "stored" property which refers to data that has already been calculated. Thus, for expensive filters, we can cache the results in a key-value store (using Redis) with an expiration date set on them that will expire after 1 week unless they are revisited (which will reset the counter). The filter should also have a "dirty" flag associated with it to determine whether or not the cached data is still valid. For ourselves, if we change our version, then our dirty flag should be set. Then in the calculation pipeline, we check our immediate ancestors upstream to see if their dirty flag is set--this check should trigger their method to check their ancestors and so forth, until we reach a leaf. These are then reported back up the chain as a series of callback functions. Thus, if anything changes upstream, then we know to invalidate the cache and to recalculate ourselves instead of just referring to data in the cache.