Skip to content

Commit

Permalink
Merge pull request #357 from OCR-D/update-guide
Browse files Browse the repository at this point in the history
user guide: update and improve
  • Loading branch information
lena-hinrichsen authored Jun 30, 2023
2 parents 6952d0d + 3842c6f commit d90e24e
Show file tree
Hide file tree
Showing 2 changed files with 355 additions and 340 deletions.
145 changes: 85 additions & 60 deletions site/en/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,38 +10,39 @@ title: Models for OCR-D processors

OCR engines rely on pre-trained models for their recognition. Every engine has
its own internal format(s) for models. Some support central storage of models
at a specific location (tesseract, ocropy, kraken) while others require the full
path to a model (calamari).
at a specific location (Tesseract, Ocropy, Kraken) while others require the full
path to a model (Calamari).

Moreover, many processors provide other file resources like configuration files or presets.

Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core
comes with a framework for managing processor resources uniformly. This means
comes with a framework for managing **file resources** uniformly. This means
that processors can delegate to OCR-D/core to resolve specific file resources by name,
looking in well-defined places in the filesystem. This also includes downloading and caching
file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database
of known resources, such as models, dictionaries, configurations and other
processor-specific data files. This means that OCR-D users should be able to
concentrate on fine-tuning their OCR workflows and not bother with implementation
details like "where do I get models from and where do I put them".
processor-specific data files. Processors can add their own specifications to that.

This means that OCR-D users should be able to concentrate on fine-tuning their OCR workflows
and not bother with implementation details like "where do I get models from and where do I put them".
In particular, users can reference file parameters by name now.

All of the above mentioned functionality can be accessed using the `ocrd
resmgr` command line tool.
All of the above mentioned functionality can be accessed using the `ocrd resmgr` command line tool.

## What models are available?

To get a list of the resources that the OCR-D/core [is aware
of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):
To get a list of the (available or installed) file resources that OCR-D/core
[is aware of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):

```
ocrd resmgr list-available
# alternatively, using Docker:
mkdir -p $PWD/models/ocrd-tesserocr-recognize
docker run --volume $PWD:/data --volume $PWD/models:/usr/local/share -w /data -- ocrd/all:maximum ocrd resmgr list-available```
docker run --volume ocrd-models:/models -- ocrd/all:maximum ocrd resmgr list-available
```

The output will look similar to this:

```
ocrd-calamari-recognize
- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz)
Calamari model trained with GT4HistOCR
Expand All @@ -64,54 +65,64 @@ The second line of each entry contains a short description of the resource.

## Installing resources

On installing resources in OCR-D, read the sections [Installing known resources](#installing-known-resources) and [Installing unknown resources](#installing-unknown-resources).
On installing resources in OCR-D, read the follow-up sections
[Installing known resources](#installing-known-resources) and
[Installing unknown resources](#installing-unknown-resources).

*Known* resources are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters)
and are available by name to `ocrd resmgr download`.

*Known resources* are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters) and are available by name to `ocrd resmgr download`, whereas *unknown* resources are models, configurations, parameter sets etc. you provide yourself or found elsewhere on the Internet, which require passing a URL to `ocrd resmgr download`.
*Unknown* resources, in contrast, are models, configurations, parameter sets etc. that you provide yourself
or found elsewhere on the Internet, which require passing a URL (or local path) to `ocrd resmgr download`.

**If you installed OCR-D via Docker,** read the section [Models and Docker](#models-and-docker) *additionally*.

### Installing known resources

You can install resources with the `ocrd resmgr download` command. It expects
the name of the processor as the first argument and either the name or URL of a
resource as a second argument.
the name of the processor as the 1st argument and the name of a resource as a 2nd argument.

Although model distribution is not currently centralised within OCR-D, we
are working towards a central model repository.
Since model distribution is decentralised within OCR-D, every processor can advertise its
own known resources, which the resource manager then picks up.

For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`:

```
ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
# or
ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz
```

This will look up the resource in the [bundled resource and user databases](#user-database), download,
unarchive (where applicable) and store it in the [proper location](#where-is-the-data).


**NOTE:** The special name `*` can be used instead of a resource name/url to
download *all* known resources for this processor. To download all tesseract models:
> **Note**: The special name `*` can be used instead of a resource name/url to
> download *all* known resources for this processor. To download all tesseract models:
```sh
ocrd resmgr download ocrd-tesserocr-recognize '*'
```

**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource
to download *all* known resources for *all* installed processors:
> **Note**: Equally, the special processor `*` can be used instead of a processor and a resource
> to download *all* known resources for *all* installed processors:
```sh
ocrd resmgr download '*'
```

(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.)
> (In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.)
### Installing unknown resources

If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`:
If you need to install a resource which OCR-D does not know of, that can be achieved by passing
its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`.

To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`.
For example, to install the same model for `ocrd-cis-ocropy-recognize` as above:

```
ocrd resmgr download -n https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
```

Or to install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`:

```
ocrd resmgr download -n https://my-server/mymodel.traineddata ocrd-tesserocr-recognize mymodel.traineddata
Expand All @@ -126,33 +137,44 @@ ocrd-tesserocr-recognize -P model mymodel

### Models and Docker

If you are using OCR-D with Docker, we recommend keeping all downloaded resources in a persistent host directory,
separate of the OCR-D Docker container(s) and data directory, and mounting that
resource directory into a specific path in the container alongside the data directory.
The host resource directory can be empty initially. Each time you run the Docker container,
your processors will access the host directory to resolve resources, and you can download
additional models into that location using `ocrd resmgr`.
If you are using OCR-D with Docker, we recommend keeping all downloaded resources **persistently**
in a host directory, independent of both:
- the Docker container's internal storage (which is transient, i.e. any change over the image
gets lost with each new `docker run`),
- the host's data directory (which may be on a different filesystem).

That resource directory needs to be mounted into a specific path in the container, as does the data directory:
- `/models`: resource files (to be mounted as a **named volume**, e.g. `-v ocrd-models:/models`),
- `/data`: input/output files (to be mounted any way you like, probably a **bind mount**, e.g. `-v $PWD:/data`),
- `/tmp`: temporary files (ideally as **tmpfs**, e.g. `--tmpfs /tmp`)

Initially, (if you use a named volume, not a bind mount,) the host resource directory will contain only
those resources that have been **pre-installed** into the processors' module directories. Each time you run
the Docker container, the Resource Manager and the processors will access that directory from the inside
to resolve resources, so you can **download additional** models into that location using `ocrd resmgr`, and
later **use them** in workflows.

The following will assume (without loss of generality) that your host-side data
path is under `./data`, and the host-side resource path is under `./models`:
path is under `./data`, and the host-side volume is called `ocrd-models`:

To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in the container FS:
To download models to `ocrd-models` in the host FS and `/models` in the container FS:

```sh
docker run --user $(id -u) \
--volume $PWD/models:/usr/local/share/ocrd-resources \
ocrd/all \
ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \
ocrd resmgr download ocrd-calamari-recognize default\; \
...
--volume ocrd-models:/models \
ocrd/all \
ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \
ocrd resmgr download ocrd-calamari-recognize default\; \
...
```

To run processors, as usual do:
To run processors, then as usual do:

```sh
docker run --user $(id -u) --workdir /data \
docker run --user $(id -u) \
--tmpfs /tmp \
--volume $PWD/data:/data \
--volume $PWD/models:/usr/local/share/ocrd-resources \
--volume ocrd-models:/models \
ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng
```

Expand All @@ -166,37 +188,42 @@ resources and lists URL and description if a database entry exists.

## User database

Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install
a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at
`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist.
Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem, or when you install
a resource with `ocrd resmgr download`, it will add a new stub entry in the user database, which is found at
`$XDG_CONFIG_HOME/ocrd/resources.yml` (where `$XDG_CONFIG_HOME` defaults to `$HOME/.config` if unset) and
gets created if it does not exist.

This allows you to use the OCR-D/core resource manager mechanics, including
lookup of known resources by name or URL, without relying (only) on the
database maintained by the OCR-D/core developers.

**NOTE:** If you produced or found resources that are interesting for the wider
OCR(-D) community, please tell us in the [OCR-D gitter
chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database.
> **Note**: If you produced or found resources that are interesting for the wider
> OCR(-D) community, please tell us in the [OCR-D gitter chat](https://gitter.im/OCR-D/Lobby)
> or open an issue in the respective Github repository, so we can add it to the database.
## Where is the data

The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters)

In order of preference, a resource `<name>` for a processor `ocrd-foo` is searched at:

* `$PWD/ocrd-resources/ocrd-foo/<name>`
* `$PWD/<name>`
* `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/<name>`
* `/usr/local/share/ocrd-resources/ocrd-foo/<name>`
* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/<name>` or `$VIRTUAL_ENV/share/ocrd-foo/<name>`
* `/usr/local/share/ocrd-resources/ocrd-foo/<name>`
* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/<name>` or `$VIRTUAL_ENV/share/ocrd-foo/<name>`
(or whatever the processor's internal module location is)

(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset).
(where `$XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset).

We recommend using the `$XDG_DATA_HOME` location, which is also the default. But
you can override the location to store data with the `--location` option, which can
be `cwd`, `data`, `system` and `module` resp.

In Docker though, `$XDG_CONFIG_HOME=$XDG_DATA_HOME/ocrd-resources=/usr/local/share/ocrd-resources`
gets symlinked to `/models` for easier volume handling (and persistency).

```sh
# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
# will download to $PWD/latest_net_G.pth
ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth
# will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth
Expand Down Expand Up @@ -228,7 +255,7 @@ To use a specific model with OCR-D's ocropus wrapper in
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz
```

**NOTE:** Model must be downloade before with
> **Note**: The model must have been downloaded before with
```sh
ocrd resmgr download ocrd-cis-ocropy-recognize fraktur-jze.pyrnn.gz
Expand All @@ -252,8 +279,6 @@ the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter:
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA
# To use your own trained model
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir
# or, to be able to control which checkpoints to use:
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json'
```

## Tesseract / ocrd_tesserocr
Expand All @@ -262,7 +287,7 @@ Tesseract models are single files with a `.traineddata` extension.

Since Tesseract only supports model lookup in a single directory,
and we want to share the tessdata directory with the standalone CLI,
ocrd_tesserocr resources must be stored in the `module` location.
`ocrd_tesserocr` resources must be stored in the `module` location.
If the default path of that location is not the place you want to use for Tesseract models,
then either recompile Tesseract with the `tessdata` path you had in mind,
or use the `TESSDATA_PREFIX` environment variable to override the `module` location at runtime.
Expand Down Expand Up @@ -307,4 +332,4 @@ you will still have to install it, first. For information on the setup and the t

# Further reading

If you just installed OCR-D and want to know how to process your own data, please see the [user guide](/en/user_guide).
If you just installed OCR-D and want to know how to process your own data, please see the [user guide](/en/user_guide).
Loading

0 comments on commit d90e24e

Please sign in to comment.