Table of Contents generated with DocToc
- DerivativeRodeo
“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)
The DerivativeRodeo
"moves" files from one storage location (e.g. input) to one or more storage locations (e.g. output) via a generator.
- Storage Location :: where we can expect to find a file.
- Generator :: a process to transform a file into another file.
In the case of a input storage location (e.g. input_location
), we expect that the underlying file pointed at by the input storage location exists. After all we can't move what we don't have.
In the case of a output storage location (e.g. output_location
), we expect that the underlying file will exist after the generator has completed. The output storage location could already exist or we might need to generate the file for the output location.
There is also the concept of the pre_processed storage location; when the pre_processed storage location exists for the given input, copy that pre_processed file to the output location. And skip running the derivative generator on the input storage location. In other words, if we've already done the derivation elsewhere, use that.
During the generator's process, we need to have a working copy of both the input and output file. This is done by creating a temporary file.
In the case of the input, the creation of that temporary file involves getting the file from the input storage location. In the case of the output, we create a temporary file that the output storage location then knows how to move to the resulting place.
The above Storage Lifecycle diagram is as follows: input location
to input tmp file
to generator
to output tmp file
to output location
.
Note: We've designed and implemented the data life cycle to automatically clean-up the temporary files as the generator completes. In this way we can use the smallest working space possible. A design decision that helps run DerivativeRodeo
within distributed clusters (e.g. AWS Serverless).
The PlantUML Text for the Overview Diagram
@startuml
!theme amiga
cloud "Source 1" as S1
cloud "Source 2" as S2
cloud "Source 3" as S3
storage "IMAGEs" as IMAGEs
storage "HOCRs" as HOCRs
storage "TXTs" as TXTs
control Preprocess as G1
S1 -down-> G1
S2 -down-> G1
S3 -down-> G1
G1 -down-> IMAGEs
G1 -down-> HOCRs
G1 -down-> TXTs
control Import as I1
IMAGEs -down-> I1
HOCRs -down-> I1
TXTs -down-> I1
package FileSet as FileSet1 {
file Image1
file Hocr1
file Txt1
}
package FileSet as FileSet2 {
file Image2
file Hocr2
file Txt2
}
I1 -down-> FileSet1
I1 -down-> FileSet2
@enduml
In this case, common storage could mean the storage where we're writing all pre-processing of files. Or it could mean the storage where we're writing for application access (e.g. Fedora Commons for a Hyrax application).
In other words, the DerivativeRodeo
is part of moving files from one location to another, and ensuring that at each step we have all of the expected files we want.
This is not strictly related to Hyrax's FileSet, that is a set of files in which one is considered the original and all others are derivatives of the original.
However it is helpful to think in those terms; files that have a significant relation to each other; one derived from the other. For example an original PDF and it's extracted text would be two significantly related files.
The PlantUML Text for the Sequence Diagram
@startuml
!theme amiga
actor Instigator
database S3
control AWS
queue SQS
control SpaceStone
control DerivativeRodeo
collections From
collections To
Instigator -> S3 : "Upload bucket\nof files associated\n with FileSet"
S3 -> AWS : "AWS enqueues\nthe bucket"
AWS -> SQS : "AWS adds to SQS"
SQS -> SpaceStone : "SQS invokes\nSpaceStone method"
SpaceStone -> DerivativeRodeo : "SpaceStone calls\n DerivativeRodeo"
DerivativeRodeo --> S3 : "Request file for\ntemporary processing"
S3 --> From : "Write requested\n file to\ntemporary storage"
DerivativeRodeo <-- From
DerivativeRodeo -> To : "Generate derivative\n writing to local\n processing storage."
To --> S3 : "Write file\n to S3 Bucket"
DerivativeRodeo <-- To : "Return to DerivativeRodeo\n with generated URIs"
SpaceStone <- DerivativeRodeo : "Return generated\n URIs"
SpaceStone -> SQS : "Optionally enqueue\nfurther work"
@enduml
Given a single original file in a previous home, we are copying that original file (and derivatives) to various locations:
- From previous home to S3.
- From S3 to local temporary storage (for processing).
- Create a derivative temporary file based on existing file.
- Copying derivative temporary file to S3.
Add this line to your application's Gemfile:
gem 'derivative-rodeo'
(Due to historical reasons the gem name is derivative-rodeo
even though the repository is derivative_rodeo
. The following "require" methods will work:
require 'derivative_rodeo'
require 'derivative-rodeo'
require 'derivative/rodeo'
And then execute: $ bundle install
Be aware that you need pdfinfo
command line tool installed for this gem to run specs or when using PDF functionality.
TODO
Generators are responsible for ensuring that we have the file associated with the generator. For example, the HocrGenerator is responsible for ensuring that we have the .hocr
file in the expected desired storage location.
Generators must have an initializer and build command:
.new(array_of_file_urls, output_location_template, preprocessed_location_template)
#generated_files
(executes the generators actions) and returns array of files#generated_uris
(executes the generators actions) and returns array of output uris
Below is the current list of generators.
- HocrGenerator :: generated tesseract files from images, also creates monocrhome files as a prestep
- MonochromeGenerator :: converts images to monochrome
- CopyGenerator :: sends a set of uris to another location. For example from S3 to SQS or from filesystem to S3.
- PdfSplitGenerator :: split a PDF into one image per page
- WordCoordinatesGenerator :: create a JSON file representing the words and coordinates (derived from the
.hocr
file).
TODO: We want to expose a list of registered generators
Storage locations are where we put things. Each location has a specific implementation but is expected to inherit from the DerivativeRodeo::StorageLocation::BaseLocation.
DerivativeRodeo::StorageLocation::BaseLocation.locations
method tracks the registered locations.
The location represents where the file should be.
Storage locations follow a URI pattern
file://
:: “local” file system storages3://
:: AWS’s S3 storage systemsqs://
:: AWS’s SQS
Throughout the code you'll see reference to the following concepts:
input_location_template
output_location_template
preprocessed_location_template
In Process Life Cycle we discussed the input_location
, output_location
, and preprocessed_location
. The concept of the template provides a flexibility in mapping a location to another location
Examples of mapping one file path to another are:
- I want to copy
https://hello.com/world/GUID/file.jpg
tofile:///tmp/GUID/file.jpg
. - I want to transform
file:///tmp/GUID/file.jpg
tofile:///tmp/GUID/file.hocr
; that is run OCR on an image and write a.hocr
file. - I want to use the
file:///tmp/GUID/file.hocr
to generate afile:///tmp/GUID/file.coordinates.json
; that is convert the HOCR file to a coordinates.json file.
See DerivativeRodeo::Service::ConvertUriViaTemplateService for more details.
- Checkout the repository:
git clone https://github.com/scientist-softserv/derivative_rodeo
- Install dependencies:
cd derivative_rodeo; bundle install
- Install git hooks:
rake install_hooks
- Install binaries:
pdfinfo
: provided by poppler (e.g.brew install poppler
)- GhostScript (e.g.
gs
): runbrew install gs
Then go about writing your code and documentation.
The git hooks call rake default
which will:
- Amend the table of contents of this file
- Run
rubocop
- Validate yard documentation (see http://rubydoc.info/gems/yard/file/docs/Tags.md#List_of_Available_Tags for help correcting warnings)
- Run
rspec
withsimplecov
Throughout the DerivativeRodeo
we log some activity. In the typical test run, the logs are overly chatty. If you want the more chatty logs run the following: DEBUG=t rspec
.
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative_rodeo.