Processing tasks in the ecosystem of the OCR-D project project.
REST API to define workflows of processors and run tasks with it in the OCR-D ecosystem.
- Free software: MIT License
- This software is in alpha state yet, so don't expect it to work properly. Support is currently not guarenteed.
We rely on the excellent installation repository ocrd_all. Please check it out for installation.
Installation is currently tested on Debian 10 and Ubuntu 18.04. Be aware that on more up-to-date systems with Python >= 3.8.x there is currently a problem installing tensorflow==1.15.x, so you have to use at most Python 3.7.
Installation for development:
Install Redis Server (needed as backend for Celery and Flower)
user@server:/ > sudo apt install redis
user@server:/ > sudo service redis start
Follow the installation for ocrd_all
/home/ocrd > git clone --recurse-submodules https://github.com/OCR-D/ocrd_all.git && cd ocrd_all
/home/ocrd/ocrd_all > make all
... -> download appropriate modules...
Install german language files for Tesseract OCR:
user@server:/ > sudo apt install tesseract-ocr-deu
Install ocrd-butler in the virtual environment created by ocrd_all:
/home/ocrd > git clone https://github.com/StaatsbibliothekBerlin/ocrd_butler.git & cd ocrd-butler
/home/ocrd/ocrd-butler > source ../ocrd_all/venv/bin/activate
(venv) /home/ocrd/ocrd-butler > pip install -e .[dev]
For some modules in ocrd_all there are further files nessesary, e.g. trained models for the OCR itself. The folders on the server can be overwritten it every single task.
sbb_textline_detector
(i.e.make textline-detector-model
):
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-sbb-textline-detector default -al cwd
ocrd_calamari
(i.e.make calamari-model
):
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0 -al cwd
ocrd_tesserocr
(i.e.make tesseract-model
):
> mkdir -p /data/tesseract_models && cd /data/tesseract_models
> wget https://qurator-data.de/tesseract-models/GT4HistOCR/models.tar
> tar xf models.tar
> cp GT4HistOCR_2000000.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
ocrd-sbb-binarize
(i.e.make sbb-binarize-model
)
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-sbb-binarize default -al cwd
Start celery worker (i.e. make run-celery
):
╰─$ TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata celery worker -A ocrd_butler.celery_worker.celery -E -l info
Start flower monitor (i.e. make run-flower
):
╰─$ flower --broker redis://localhost:6379 --persistent=True --db=flower [--log=debug --url_prefix=flower]
Flower monitor: http://localhost:5555
Run the app (i.e. make run-flask
):
╰─$ FLASK_APP=ocrd_butler/app.py flask run
Flask frontend: http://localhost:5000 Swagger interface: http://localhost:5000/api
Run the tests:
╰─$ make test
For API documentation, open the Swagger API user interface at /api/
. A complete list of all
routes mapped by the OCRD Butler application is available under the /api/_util/routes
endpoint.
A Butler workflow consists of a name and one or more OCRD processor invocations.
Use the /api/workflows
POST endpoint to create a new workflow (all examples
given using HTTPie):
╰─$ http POST :/api/workflows < workflow.json
...where the content of workflow.json
looks something like this:
{ "name": "binarize && segment to regions", "processors": [ { "name": "ocrd-olena-binarize", "input_file_grp": "DEFAULT", "output_file_grp": "OCR-D-IMG-BIN" }, { "name": "ocrd-tesserocr-segment-region", "input_file_grp": "OCR-D-IMG-BIN", "output_file_grp": "OCR-D-SEG-REGION" } ] }
The response body will contain the ID of the newly created workflow. Use this ID for retrieval of the newly created workflow:
╰─$ http :/api/workflows/1 # or whatever ID obtained in previous step
A Butler task is an invocation of a workflow with a specific METS file as
its input. A task consists of at least such a METS source file location, and a
workflow ID. Use the /api/tasks
POST endpoint to create a new task using an
existing workflow:
╰─$ http POST :/api/tasks src=https://content.staatsbibliothek-berlin.de/dc/PPN718448162.mets.xml workflow_id=1
The response body will contain the ID of the newly created task.
In order to execute an existing Butler task, call the /api/tasks/{id}/run
endpoint, with the placeholder replaced by the actual task ID obtained in the
previous step:
╰─$ http POST :/api/tasks/1/run
ModuleNotFoundError: No module named 'tensorflow.contrib'
. venv/activate
pip install --upgrade pip
pip uninstall tensorflow
pip install tensorflow-gpu==1.15.*
- input and output filegroups are not always from the previous processor - more complicated input/output group scenarios - check the infos we get from ocrd-tools.json
- dinglehopper: - If there are Ground Truth data it could be placed in a configured folder on the server with the data as page xml files inside a folder id named with the work id. Then we show a button to start a run against this data. Otherwise we can search for all other tasks with the same work_id and present a UI to run against the choosen one.
- Use processor groups to be able to build forms with these presented.
- Check if ocrd-olena-binarize fail with another name for a METS file in a workspace then mets.xml.
- Refactor ocrd_tool information collection to https://ocr-d.de/en/spec/cli#-j---dump-json
This package was created with Cookiecutter and the elgertam/cookiecutter-pipenv project template, based on audreyr/cookiecutter-pypackage.