Skip to content

Pipeline Design

Brian Broll edited this page Jun 7, 2016 · 5 revisions

NOTICE: This page is out-of-date! The general principles are correct but some of the details (exact directory structure) are inaccurate.

DeepForge Pipeline Design

Pipelines (or workflows) in DeepForge allow users to diagram batches of operations to perform on data. This may include data normalization, transformation, ensembling models, etc.

Component Overview

Pipelines are composed of three main components:

  • Start Operations (DataRetrievers)
  • Operations
  • Data

Start Operations

Start operations (or data retrievers) are operations which retrieve data. That is, start operations start the workflow and are the only nodes which do not require input data.

Operations

Operations are a generic concept in the DeepForge pipeline. Operations are simply lua scripts which, given the input data (and any specified attributes/references), perform some operation on the data and return a number of lua objects (and, potentially, resource files).

Data

Data is visualized as connections in the pipeline. Data can have associated attributes, such as dimensionality. When a pipeline is executed, the data connection will also reference the output data of it's source operation.

Execution

When DeepForge executes a pipeline, a snapshot of the pipeline is taken and executed.

Data

The associated data of a data node is expected to contain an init.lua file and any other files required by the init.lua file. The init.lua file should not define any globals.

Operations

Operations create variable (known) amounts of output data. If the operation has multiple return types, the return types are required to be named.

Operations execute in the following environment:

init.lua
attributes.lua
references.lua
res/
input/
output/

The input/ directory is populated with all incoming data types (by connection name - which match the incoming arg names). For example, an operation with two arguments, say a and b would have the following structure:

input/a/init.lua
input/a/res.yml
input/a/res2.yml
input/b/init.lua
input/b/res.xml

In the above example, the value of a and b are retrieved by loading the respective init.lua files.

a = require('./input/a')  -- requiring './input/a/init.lua'
b = require('./input/b')

-- rest of the file...

The output/ directory is the target directory which contains the files returned to the output connections displayed in the DeepForge UI (combined w/ the respective init.lua files associated with each return value). For example, if we have an operation which is an image classifier grouping images by class where the classes are dog, cat, and fish, then we will may have the following structure for output/:

output/dog/dogs.t7
output/cat/cats.t7
output/fish/fish.t7

The operation will also specify init files for each of the outputs. Following the completion of the operation, these init files and the contents of each directory are zipped and associated with the respective outgoing connection(s) in the DeepForge UI. When any subsequent operation uses these values, these zip files are unzipped to input/<ARG_NAME> where <ARG_NAME> is the name of the argument expected by this subsequent operation.

The attributes.lua file returns a table associating the given operation node's attributes with their values. If the attribute is an asset, then the value for the key is a path to the asset in res/.

The references.lua file contains a table associating the operation node's pointer names with generated artifacts of the target of the pointer name. For example, a training node may have a reference, say network, to the architecture which it is using for the training. In this case, the references.lua will have a key network associated with the path to the lua code that creates the given network.

As mentioned above, the res/ directory contains any resources acquired from the resolution of an attribute or reference.

Combining all these parts, we have operations that are executed by running th init.lua, retrieve input args from input/<ARG_NAME> (autogenerated from model), serialize outputs to output/<NAME>, and can access operation attributes and references using attributes.lua, references.lua and res/.

Custom Operations

When defining custom operations, the following must be specified:

  • input args (and types!)
  • output(s) (and types!)
    • Each output will use a standard init.lua for each return type. Therefore, the operation must serialize the output results in a standard way.
  • init.lua script
    • The actual lua code for the operation

Additional Comments

Saving data for future pipelines

  • Special Operations allow for the saving of data to DeepForge libraries for future reuse.
    • When data is saved, a 'data' node can be created with the required content stored. This node can then be referenced by various data retrievers for other pipelines
      • These data retrievers would simply return the value stored by their target reference
    • This would give us a directory for all the models that we have been training