From 65918c659aa66fe925822b76d5b032dc447adc20 Mon Sep 17 00:00:00 2001 From: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> Date: Thu, 22 Jun 2023 18:56:29 +0200 Subject: [PATCH 1/4] user guide: update and improve --- site/en/user_guide.md | 429 +++++++++++++++++++----------------------- 1 file changed, 197 insertions(+), 232 deletions(-) diff --git a/site/en/user_guide.md b/site/en/user_guide.md index 5edb99ad2..f491788e3 100644 --- a/site/en/user_guide.md +++ b/site/en/user_guide.md @@ -23,15 +23,14 @@ This guide always states native calls first and then provides the respective com ### Docker installation: -If you are using the Installation via Docker, we recommend to run: +If you are using the Installation via Docker, we recommend running an interactive shell session in the container: ```sh -mkdir -p $PWD/models/ocrd-tesserocr-recognize -docker run --user $(id -u) --workdir /data --volume $PWD:/data --volume $PWD/models:/usr/local/share/ocrd-resourcese $PWD/models/ocrd-tesserocr-recognize:/usr/local/share/tessdata --volume $PWD/models:/usr/local/share/ocrd-resources -it ocrd/all bash +docker run --user $(id -u) --volume $PWD:/data --volume ocrd-models:/usr/local/share/ocrd-resources -it ocrd/all bash ``` After spinning up the container, you can use the installation and call the processors the same way as in the native installation. @@ -63,13 +62,13 @@ your venv. ### Preparing a workspace -OCR-D processes digitized images in so-called workspaces, special directories -which contain the images to be processed and their corresponding METS file. Any -files generated while processing these images with the OCR-D-software will also -be stored in this directory. +OCR-D processes digitized images in so-called [workspaces](spec/glossary#workspace), +i.e. special directories which contain the images to be processed and their corresponding +METS file. Any files generated while processing these images with the OCR-D software +will also be stored in this directory. -How you prepare a workspace depends on whether you already have or don't have a -METS file with the paths to the images you want to process. For usage within +How you prepare a workspace depends on whether or not you already have a METS file +with the paths (or URLs) to the images you want to process. For usage within OCR-D your METS file should look similar to [this example](example_mets). #### Already existing METS @@ -78,12 +77,17 @@ If you already have a METS file as indicated above, you can create a workspace and load the pictures to be processed with the following command: ```sh -ocrd workspace -d /path/to/workspace clone URL_OF_METS -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace clone -d /data URL_OF_METS +ocrd workspace [-d path/to/workspace] clone URL_OF_METS +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace clone [-d path/to/your/workspace] URL_OF_METS ``` -This will create a directory (called workspace in OCR-D) with your specified name which contains your METS file. +(Where `path/to/your/workspace` is but an example. You can **omit +the directory argument** if you want to use the current working directory +as target. For repeated use, we recommend a `cd path/to/your/workspace` once, +so in subsequent operations, the argument can be omitted.) + +This will create a file `mets.xml` within the target directory. In most cases, METS files indicate several picture formats. For OCR-D you will only need one format. We strongly recommend using the format with the highest @@ -93,8 +97,8 @@ List all existing groups: ```sh ocrd workspace [-d /path/to/your/workspace] list-group -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace -d /data list-group +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] list-group ``` This will provide you with the names of all the different file groups in your METS, e.g. THUMBNAILS, @@ -103,9 +107,9 @@ PRESENTATION, MAX. Download all files of one group: ```sh -ocrd workspace [-d /path/to/your/workspace] find --file-grp [selected file group] --download -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace -d /data find --file-grp [selected file group] --download +ocrd workspace [-d path/to/your/workspace] find --file-grp [selected file group] --download +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] find --file-grp [selected file group] --download ``` This will download all images in the specified file group and save them in a folder named accordingly @@ -113,105 +117,118 @@ in your workspace. You are now ready to start processing your images with OCR-D. #### Non-existing METS -If you don't have a METS file or it doesn't suffice the OCR-D-requirements you -can generate it with the following commands. First, you have to create a +If you don't have a METS file or it does not comply with the OCR-D requirements, +then you can generate one with the following commands. First, create an empty workspace: ```sh -ocrd workspace [-d /path/to/your/workspace] init # omit `-d` for current directory -## alternatively using docker -mkdir -p [/path/to/your/workspace] -docker run --rm -u $(id -u) -v [/path/to/your/workspace]:/data -w /data -- ocrd/all:maximum ocrd workspace -d /data init +ocrd workspace [-d path/to/your/workspace] init +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] init ``` -This should create a directory (called workspace in OCR-D) which contains a `mets.xml`. +(Where `path/to/your/workspace` is but an example. You can **omit +the directory argument** if you want to use the current working directory +as target. For repeated use, we recommend a `cd path/to/your/workspace` once, +so in subsequent operations, the argument can be omitted.) -Then you can change into your workspace directory and set a unique ID +This will create a file `mets.xml` within the target directory. + +Then you can set a unique `mods:identifier` … ```sh -cd /path/to/your/workspace # if not already there -ocrd workspace set-id 'unique ID' -## alternatively using docker -docker run --rm -u $(id -u) -v [/path/to/your/workspace]:/data -w /data -- ocrd/all:maximum ocrd workspace set-id 'unique ID' +ocrd workspace [-d path/to/your/workspace] set-id 'unique ID' +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] set-id 'unique ID' ``` -and copy the folder containing your pictures to be processed into the workspace: +… and copy the directory containing the images to be processed into the workspace directory: ```sh -cp -r [/path/to/your/pictures] . +cp -r path/to/your/images [path/to/your/workspace/]. ``` -**Note:** All pictures must have the same format (tif, jpg, ...) - -In OCR-D we name the image folder OCR-D-IMG, which is used throughout the documentation. Naming your image folder differently is -certainly possible, but you should be aware that you need to adapt the name of the image folder if copy and paste the sample -calls provided on this website. - -You should now have a workspace which contains the aforementioned `mets.xml` and a folder OCR-D-IMG with your images. -Now you can add your pictures to the METS. When creating the workspace, a blank -METS file was created, too, to which you can add the pictures to be processed. +Now you can add images to the empty METS created above by adding +references for their path names. -You can do this manually with the following command: +You can do this in a number of ways. Either with the following simple command: ```sh -ocrd workspace add -g [ID of the physical page, has to start with a letter] -G [name of picture folder in your workspace] -i [ID of the scanned page, has to start with a letter] -m image/[format of your picture] [/path/to/your/picture/in/workspace] -## alternatively using docker -docker run --rm -u $(id -u) -v [/path/to/workspace]:/data -w /data -- ocrd/all:maximum ocrd workspace add -g [ID of the physical page, has to start with a letter] -G [name of picture folder in your workspace] -i [ID of the scanned page, has to start with a letter] -m image/[format of your picture] [relative/path/to/your/picture/in/workspace] +ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page (must start with a letter)} -G {name of image fileGrp} -i {ID of the image file (must start with a letter)} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page (must start with a letter)} -G {name of image fileGrp} -i {ID of the image file (must start with a letter)} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} ``` -Your command could e.g. look like this: +For example, your simple commands could look like this: ```sh -ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tif OCR-D-IMG/00001.tif -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tif OCR-D-IMG/00001.tif +ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00001.tif +ocrd workspace add -g P_00002 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00002.tif +... +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00001.tif +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -g P_00002 -G OCR-D-IMG -i OCR-D-IMG_00002 -m image/tiff images/00002.tif ``` -If you have many pictures to be added to the METS, you can do this automatically with a for-loop: +Or, if you have lots of images to be added to the METS, you can do this automatically with a `for` loop: +> **Note:** For this method, all images must have the same format (tiff, jpeg, ...) ```sh -FILEGRP="YOUR-FILEGRP-NAME" +FILEGRP="OCR-D-IMG" # name of fileGrp to use EXT=".tif" # the actual extension of the image files -MEDIATYPE='image/tiff' # the actual media type of the image files -## using local ocrd CLI -for i in /path/to/your/picture/folder/in/workspace/*$EXT; do - base=`basename ${i} $EXT`; - ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_${base} -m $MEDIATYPE ${i}; -done -## alternatively using docker -for i in /path/to/your/picture/folder/in/workspace/*$EXT; do - base=`basename ${i} $EXT`; - docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_${base} -m $MEDIATYPE ${i}; +MEDIATYPE="image/tiff" # the actual MIME type of the images +cd path/to/your/workspace +for path in images/*$EXT; do + base=`basename $path $EXT`; + ## using local ocrd CLI: + ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_$base -m $MEDIATYPE $path + ## alternatively, using docker: + docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_$base -m $MEDIATYPE $path done ``` -warning If the file names of the images start with a number, at least one of the following characters must be placed in front of its name for parameter 'i': [a-z,A-Z,_,-] (e.g.: 'OCR-D-IMG_\_') - -Your for-loop could e.g. look like this: +For example, your `for` loop could look like this: ```sh -for i in OCR-D-IMG/*.tif; do base=`basename ${i} .tif`; ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_${base} -g P_${base} -m image/tif ${i}; done -## alternatively using docker -for i in OCR-D-IMG/*.tif; do base=`basename ${i} .tif`;docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_${base} -g P_${base} -m image/tif ${i}; done +for path in images/*.tif; do base=`basename $path .tif`; ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_$base -g P_$base -m image/tiff $path; done +## alternatively, using docker: +for path in images/*.tif; do base=`basename $path .tif`; docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_$base -g P_$base -m image/tiff $path; done ``` -The log information should inform you about every image which was added to the `mets.xml`. -In the end, your METS file should look like this [example METS](example_mets). You are now ready to start processing your images with OCR-D. +The log information should inform you about every image which was added to the METS file. +In the end, your `mets.xml` should look like this [example METS](example_mets). +You are now ready to start processing your images with OCR-D. -Alternatively, `ocrd-import` from [workflow-configuration](#workflow-configuration) is a shell script which does all of the above (and can also convert arbitrary image formats) automatically. For usage options, see: +Finally, the shell script `ocrd-import` from [workflow-configuration](#workflow-configuration) +is a tool which does all of the above (and can also convert arbitrary image formats and extract from PDFs) +automatically. For usage options, see: ```sh ocrd-import -h ``` -For example, to search for all files under `path/to/your/pictures/` recursively, and add all image files under file group `OCR-D-IMG`, keeping their filename stem as page ID, and converting all unaccepted image file formats like JPEG2000, XPS or PDF (the latter rendered to bitmap at 300 DPI) to TIFF on the fly, and also add any PAGE-XML file of the same filename stem under file group `OCR-D-SEG-PAGE`, while ignoring other files, and finally write everything to `path/to/your/pictures/mets.xml`, do: +For example, to search for all files under `path/to/your/images/` recursively, +and add all image files under fileGrp `OCR-D-IMG`, keeping their filename stem as page ID, +while converting all unsupported image file formats like JPEG2000, XPS or PDF +(the latter rendered to bitmap at 300 DPI) to TIFF on the fly, +and also add any PAGE-XML file of the same filename stem under fileGrp `OCR-D-SEG-PAGE`, +while ignoring other files, and finally write everything to `path/to/your/images/mets.xml`, do: ```sh -ocrd-import --nonnum-ids --ignore --render 300 path/to/your/pictures +ocrd-import --nonnum-ids --ignore --render 300 path/to/your/images ## alternatively using docker -docker run --rm -u $(id -u) -v [/path/to/your/data]:/data -w /data -- ocrd/all:maximum ocrd-import -P -i -r 300 path/to/your/pictures +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-import -P -i -r 300 path/to/your/images ``` +You should now have a workspace which contains the aforementioned `mets.xml` that has +a fileGrp `OCR-D-IMG` referencing your local image files. + +> **Note**: In OCR-D, we typically name the image fileGrp `OCR-D-IMG`, which is used +> throughout the documentation. Naming your image fileGrp differently is of course possible, +> but you should be aware that you then need to adapt the name of the image or input fileGrp +> when copying and pasting from the sample calls provide on this website. + + ## Using the OCR-D-processors ### OCR-D-Syntax @@ -232,64 +249,61 @@ For some processors parameters are purely optional, other processors as e.g. `oc ### Calling a single processor If you just want to call a single processor, e.g. for testing purposes, you can go into your workspace and use the following command: ```sh -ocrd-[processor needed] -I [Input-Group] -O [Output-Group] -P [parameter] -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-[processor needed] -I [Input-Group] -O [Output-Group] -P [parameter]' +ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter file}] [-P {parameter} {value}] +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter file}] [-P {parameter} {value}] ``` -Your command could e.g. look like this: +For example, your processor call command could look like this: ```sh ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola ``` -The specified processor will take the files in your Input-Group `-I`, process them and save the results in your Ouput-Group `-O`. It will also add -the information about this processing step and its results to METS file in your workspace. +The specified processor will read the files in fileGrp `Input-Group`, +binarize them and write the results in fileGrp `Ouput-Group` in your workspace +(i.e. both as files on the filesystem and referenced in the `mets.xml`). +It will also add information about this processing step in the METS metadata. -**Note:** For processors using multiple input-, or output groups you have to use a comma separated list. +> **Note:** For processors using multiple input- or output fileGrps you have to use a comma-separated list. E.g.: ```sh -ocrd-anybaseocr-crop -I OCR-D-IMG -O OCR-D-CROP,OCR-D-IMG-CROP -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-anybaseocr-crop -I OCR-D-IMG -O OCR-D-CROP,OCR-D-IMG-CROP +ocrd-cor-asv-ann-align -I OCR-D-OCR1,OCR-D-OCR2,OCR-D-OCR3 -O OCR-D-OCR4 +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-cor-asv-ann-align -I OCR-D-OCR1,OCR-D-OCR2,OCR-D-OCR3 -O OCR-D-OCR4 ``` -**Note:** If multiple parameter key-value pairs are necessary, each of them has to be preceded by `-P` - -E.g.: - +> **Note:** If multiple parameter key-value pairs are necessary, each of them has to be preceded by `-P` as in ```sh --P param1 value1 -P param2 value2 -P param3 value3 +... -P param1 value1 -P param2 value2 -P param3 value3 ``` -**Note:** If a value consists of several words with whitespaces, they have to be enclosed in quotation marks - -E.g.: - +> **Note:** If a value consists of several words with whitespaces, they have to be enclosed in quotation marks +> (to prevent the shell from splitting them up) as in ```sh -P param "value value" ``` ### Calling several processors -#### ocrd-process +#### ocrd process If you quickly want to specify a particular workflow on the CLI, you can use -ocrd-process, which has a similar syntax as calling single processors. +`ocrd process`, which has a similar syntax as calling single processor CLIs. ```sh ocrd process \ - '[processor needed without prefix 'ocrd-'] -I [Input-Group] -O [Output-Group]' \ - '[processor needed without prefix 'ocrd-'] -I [Input-Group] -O [Output-Group] -P [parameter]' -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd process \ - '[processor needed without prefix 'ocrd-'] -I [Input-Group] -O [Output-Group]' \ - '[processor needed without prefix 'ocrd-'] -I [Input-Group] -O [Output-Group] -P [parameter]' -``` + '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group}' \ + '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group} -P {parameter} {value}' +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd process \ + '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group}' \ + '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group} -P {parameter} {value}' + ``` -Your command could e.g. look like this: +For example, your command could look like this: ```sh ocrd process \ @@ -297,103 +311,59 @@ ocrd process \ 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \ 'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \ 'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR -P model Fraktur' -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd process \ +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd process \ 'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE' \ 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \ 'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \ 'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR -P model Fraktur' ``` -Each specified processor will take all the files in your files in the respective Input-Group `-I`, process them and save the -results in the respective Ouput-Group `-O`. It will also add the information about this processing step and its results to the METS file in your workspace. -The processors work on the files sequentially. So at first all files will be processed with the first processor (e.g. binarized), then all files -will be processed by the second processor (e.g. segmented) etc. In the end your workspace should contain a folder for each Output-Group -O specified -in your workflow, which contains the (intermediate) processing results. - -**Note:** In contrast to calling a single processor, for `ocrd process` you leave -out the prefix `ocrd-` before the name of a particular processor. - -#### Taverna - -Taverna is a more sophisticated workflow-software which allows you to specify a -particular workflow in a file and call this workflow, or rather its file, on -several workspaces. - -Note that Taverna is not included in your [`ocrd_all`](https:/github.com/OCR-D/ocrd_all) installation. Therefore, you still might have to install it following this [setup guide](setup.md). - -Taverna comes with several predefined workflows which can help you getting started. These are stored in the `/conf` directory. - -1. parameters.txt (best results without gpu) -2. parameters_fast.txt (good results for slower computers) -3. parameters_gpu.txt (best results with gpu) - -**Note:** Those workflows are only tested with a limited set of pages of the 17./18. century. Results may be worse for other prints. - -For every workflow at least two files are needed: A `workflow_configuration` file contains a particular workflow which is invoked by a `parameters` file. For calling a workflow via Taverna, change into the `Taverna` folder and use the following command: - -```sh -bash startWorkflow.sh [particular parameters.txt] [/path/to/your/workspace] -## alternatively using docker -docker run --rm --network="host" -v $PWD:/data -- ocrd/taverna process [particular parameters.txt] [relative/path/to/your/workspace] -``` - -The images in your indicated workspace will be processed and the respective -output will be saved into the same workspace. - -When you want to adjust a workflow for better results on your particular -images, you should start off by copying the original `workflow_configuration` -and `parameters` files. To this end, change to the `/conf` subdirectory of -`Taverna` and use the following commands: +Each specified processor will read the files in the respective fileGrp `Input-Group`, +process them accordingly, and write the results in the respective fileGrp `Ouput-Group` +in your workspace (i.e. both as files on the filesystem and referenced in the `mets.xml`). +It will also add information about this processing step in the METS metadata. -```sh -conf$ cp workflow_configuration.txt [name of your new workflow_configuration.txt] -conf$ cp parameters.txt [name of your new parameters.txt] -``` +The processors work on the files sequentially. So at first, all pages will be processed +with the first processor (e.g. binarized), then all pages will be processed +by the second processor (e.g. segmented) etc. -Open the new `parameters.txt` file with an editor like e.g. Nano and change the -name of the old `workflow_configuration.txt` specified in this file to the name -of your new `workflow_configuration.txt` file: +So In the end your workspace should contain a directory (and fileGrp) with (intermediate) +processing results for each output fileGrp specified in the workflow. -```sh -nano [name of your new workflow_configuration.txt] -``` - -Then open your new `workflow_configuration.txt` file respectively and adjust it to your needs by exchanging or adding the specified processors of parameters. The first column contains the name of the processor, the following two columns indicate the names of the input and the output filegroups. The forth column for group-ID can be left blank. In the last column you can indicate the log level. - -If your processor requires a parameter, it has to be specified in the fith column. As with parameters when calling processors directly on the CLI, there are two ways how to specify them. You can either call a `json` file which should be stored in Taverna's subdirectory `models`. See [Calling a single processor](TODO) on how to create `json` files. Alternatively, you can directly write down the parameter needed using the following syntax: - -```sh -{\"[param1]\":\"[value1]\",\"[param2]\":\"[value2]\",\"[param3]\":\"[value3]\"} -e.g. -{\"level-of-operation\":\"page\"} -``` - -**Note:** Avoid white spaces and escape double quotes with backslash. +> **Note:** In contrast to calling a single processor, for `ocrd process` you leave +out the prefix `ocrd-` before the name of a particular processor. For information on the available processors see [section at the end](#get-more-information-about-processors). - #### workflow-configuration -workflow-configuration is another tool for specifying OCR-D workflows and running them. It uses GNU make as workflow engine, treating document processing like software builds (including incremental and parallel computation). Configurations are just makefiles, targets are workspaces and their file groups. +`ocrd-make` is another tool for specifying OCR-D workflows and running them. It combines GNU `parallel` with GNU `make` +as workflow engine, treating document processing like software builds (including incremental and parallel computation). +Configurations are just makefiles, targets are workspaces and their file groups. -In contrast to Taverna it is included in ocrd_all, therefore you most likely already installed it with the other OCR-D-processors. +It is included in [ocrd_all](https://github.com/OCR-D/ocrd_all), therefore you most likely already installed it along +with the other OCR-D processors. -The `workflow-configuration` directory already contains several workflows, which were tested against the Ground Truth provided by OCR-D. For the CER of those workflows in our tests see [the table on GitHub](https://github.com/bertsky/workflow-configuration#usage). +The `workflow-configuration` directory already contains several example workflows, which were tested against the +Ground Truth provided by OCR-D. For CER results of those workflows in our tests see [the table on GitHub](https://github.com/bertsky/workflow-configuration#usage). -**Note:** Most workflows are configured for GT data, i.e. they expect preprocessed images which were already segmented at least down to line level. If you want to run them on raw images, you have to add some preprocessing and segmentation steps first. Otherwise they will fail. +> **Note:** Most workflows are configured for GT data, i.e. they expect preprocessed images which were already segmented +> at least down to line level. If you want to run them on raw images, you have to add some preprocessing and segmentation steps first. +> Otherwise they will fail. -In order to run a workflow, change into your data directory (that contains the workspaces) and call the desired configuration file on your workspace(s): +In order to run a workflow, change into your data directory (that contains the workspaces) and call the desired configuration file +on your workspace(s): ```sh -ocrd-make -f [name_of_your_workflow.mk] [/path/to/your/workspace1] [/path/to/your/workspace2] +ocrd-make -f {name_of_your_workflow.mk} [/path/to/your/workspace1] [/path/to/your/workspace2] ... ``` -As indicated in the command above, you can run a workflow on several workspaces by listing them after one another. Or use the special target `all` for all the workspaces in the current directory. -The documents in those workspaces will be processed and the respective -output along with the log files will be saved into the same workspace(s). +As indicated in the command above, you can run a workflow on several workspaces by listing them after one another. +Or use the special target `all` for all the workspaces in the current directory. +The documents in those workspaces will be processed and the respective output +along with the log files will be saved into the same workspace(s). For an overview of all available targets and workspaces: @@ -411,74 +381,66 @@ images, you should start off by copying the original `workflow.mk` file: ```sh -cp workflow.mk [name_of_your_new_workflow_configuration.mk] +cp workflow.mk {name_of_your_new_workflow_configuration.mk} ``` -Then open the new file with an editor which understands `make` syntax like e.g. `nano`, and exchange or add the processors or parameters to your needs: +Then open the new file with an editor which understands `make` syntax like e.g. `nano`, +and exchange or add the processors or parameters to your needs: ```sh -nano [name_of_your_new_workflow_configuration.mk] +nano {name_of_your_new_workflow_configuration.mk} ``` -You can write new rules by using file groups as prerequisites/targets in the normal GNU make syntax. The first target defined must be the default goal that builds the very last file group for that configuration. Alternatively a variable `.DEFAULT_GOAL` pointing to that target can be set anywhere in the makefile. +You can write new rules by using file groups as prerequisites/targets in the normal GNU make syntax. +The first target defined must be the default goal that builds the very last file group for that configuration. +Alternatively a variable `.DEFAULT_GOAL` pointing to that target can be set anywhere in the makefile. -**Note:** Also see the [extensive Readme of workflow-configuration](https://bertsky.github.io/workflow-configuration) on how to adjust the preconfigured workflows to your needs. +> **Note:** Also see the [extensive Readme of workflow-configuration](https://bertsky.github.io/workflow-configuration) +> on how to adjust the preconfigured workflows to your needs. -Each specified processor will take all the files in your files in the respective Input-Group `-I`, process them and save the -results in the respective Ouput-Group `-O`. It will also add the information about this processing step and its results to the METS file in your workspace. -The processors work on the files sequentially. So at first all files will be processed with the first processor (e.g. binarized), then all files -will be processed by the second processor (e.g. segmented) etc. In the end your workspace should contain a folder for each Output-Group -O specified -in your workflow, which contains the (intermediate) processing results. - -#### Translating native commands to docker calls -The native calls presented above are simple to translate to commands based on the -docker images by prepending the boilerplate telling Docker which image to use, +#### Translating native commands to Docker calls +The command calls presented above are easy to translate for use in our +Docker images – simply by prepending the boilerplate telling Docker which image to use, which user to run as, which files to bind to a container path etc. For example a call to -[`ocrd-tesserocr-binarize`](https://github.com/OCR-D/tesserocr) might natively -look like this: +[`ocrd-tesserocr-segment`](https://github.com/OCR-D/ocrd_tesserocr) might natively +look like this … ```sh -ocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK +ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG ``` -To run it with the [`ocrd/all:maximum`] Docker container: +… to run it with the [`ocrd/all:maximum`] Docker container … ```sh -docker run -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK - \_________/ \___________/ \______/ \_________________/ \___________________________________________________________/ - (1) (2) (3) (4) (5) +docker run -u $(id -u) -v $PWD:/data -v ocrd-models:/usr/local/share/ocrd-resources -- ocrd/all:maximum ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG + \_________/ \___________/ \_____________________________________________/ \________________/ \______________________________________________/ + (1) (2) (3) (4) (5) ``` - -* (1) tells Docker to run the container as the calling user instead of root -* (2) tells Docker to bind the current working directory as the `/data` folder in the container -* (3) tells Docker to change the container's working directory to `/data` -* (4) tells docker which image to run -* (5) is the unchanged call to `ocrd-tesserocr-segment-region` +* (1) tells Docker to run the container as the calling user (who should have write access to the CWD) instead of root +* (2) tells Docker to bind-mount the current working directory (CWD) under `/data` in the container +* (3) tells Docker to mount `/usr/local/share/ocrd-resources` in the container (i.e. the location for all models) under the **named volume** `ocrd-models` +* (4) tells Docker which image to spawn a container for +* (5) is the unchanged call to the processor -**Note:** Add `-v $PWD/models:/usr/local/share/tessdata` when using `ocrd-tesserocr-recognize` and/or add -`-v $PWD/models:/usr/local/share/ocrd-resources` when using processors which need models to run in general. +> **Note**: You can replace the host-side path in (2) with any absolute directory path. -It can also be useful to delete the container after creation with the `--rm` -parameter. +> **Note**: Make sure to keep re-using the same named volume for models and other file resources under (3). +> For details, see [models and Docker](models#models-and-docker) -### Specifying New OCR-D-Workflows +> **Note**: It can also be useful to have Docker automatically delete the container after termination +> by adding the `--rm` option. -When you want to specify a new workflow adapted to the features of particular -images, we recommend using an existing workflow as specified in `Taverna` or -`workflow-configuration` as starting point. You can adjust it to your needs by -exchanging or adding the specified processors of parameters. For an overview on -the existing processors, their tasks and features, see the [next section](#get-more-information-about-processors) and our [workflow guide](workflows.html). +### Specifying new OCR-D workflows +When you want to specify a new workflow adapted to the features of particular +images, we recommend using an existing workflow as specified in the [Workflow Guide](workflows) +as starting point. You can adjust it to your needs by exchanging or adding the specified parameters +and/or processors. For an overview on the existing processors, their tasks and features, see the +[next section](#get-more-information-about-processors) and our [workflow guide](workflows.html). @@ -486,23 +448,26 @@ the existing processors, their tasks and features, see the [next section](#get-m To get all available processors you might use the autocomplete in your preferred console. -**Note:** If you installed OCR-D via Docker make sure you run the bash shell in the ocrd docker image as described in the section [Preparations](#docker-installation). -If you installed OCR-D natively, activate the virtual environment first as described in the section [Preparations](#native-installation-activate-virtual-environment). +> **Note**: If you installed OCR-D via Docker, make sure you run the interactive bash shell +> on the ocrd/all Docker image as described in the section [Preparations](#docker-installation). +> If you installed OCR-D natively, activate the virtual environment first as described in the section +> [Preparations](#native-installation-activate-virtual-environment). -Type 'ocrd-' followed by `TAB` to get a list of all available processors. +Type `ocrd-` followed by a tab character (for autocompletion proposals) to get a list of all available processors. -To get further information about a particular processor, you can call `--help` +To get further information about a particular processor, call it with `--help` or `-h`: ```sh [name_of_selected_processor] --help -## alternatively using docker -docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum [name_of_selected_processor] --help +## alternatively, using docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum [name_of_selected_processor] --help ``` ### Using models -Several processors rely on models which have to be downloaded beforehand. An overview on the existing model repositories and short -descriptions on the most important models can be found [in our models documentation](https://ocr-d.de/en/models). -We strongly recommend to use the [OCR-D resource manager](https://ocr-d.de/en/models) to download the models, as this makes it -easier to both download and use them. +Several processors rely on models, which usually have to be downloaded beforehand. +An overview on the existing model repositories and short descriptions on the most important models +can be found [in our models documentation](https://ocr-d.de/en/models). +We strongly recommend to use the [OCR-D resource manager](https://ocr-d.de/en/models) to download the models, +as this makes it easy to both download and use them. From 068933d9343125ed34e0d785e5d87bd7e58a6a6f Mon Sep 17 00:00:00 2001 From: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> Date: Thu, 22 Jun 2023 19:20:33 +0200 Subject: [PATCH 2/4] models: update, too --- site/en/models.md | 82 ++++++++++++++++++++++++++++------------------- 1 file changed, 49 insertions(+), 33 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index b2117e0e1..dbccf2474 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -10,33 +10,35 @@ title: Models for OCR-D processors OCR engines rely on pre-trained models for their recognition. Every engine has its own internal format(s) for models. Some support central storage of models -at a specific location (tesseract, ocropy, kraken) while others require the full -path to a model (calamari). +at a specific location (Tesseract, Ocropy, Kraken) while others require the full +path to a model (Calamari). + +Moreover, many processors provide other file resources like configuration files or presets. Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core -comes with a framework for managing processor resources uniformly. This means +comes with a framework for managing **file resources** uniformly. This means that processors can delegate to OCR-D/core to resolve specific file resources by name, looking in well-defined places in the filesystem. This also includes downloading and caching file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database of known resources, such as models, dictionaries, configurations and other -processor-specific data files. This means that OCR-D users should be able to -concentrate on fine-tuning their OCR workflows and not bother with implementation -details like "where do I get models from and where do I put them". +processor-specific data files. Processors can add their own specifications to that. + +This means that OCR-D users should be able to concentrate on fine-tuning their OCR workflows +and not bother with implementation details like "where do I get models from and where do I put them". In particular, users can reference file parameters by name now. -All of the above mentioned functionality can be accessed using the `ocrd -resmgr` command line tool. +All of the above mentioned functionality can be accessed using the `ocrd resmgr` command line tool. ## What models are available? -To get a list of the resources that the OCR-D/core [is aware -of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml): +To get a list of the (available or installed) file resources that OCR-D/core +[is aware of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml): ``` ocrd resmgr list-available # alternatively, using Docker: -mkdir -p $PWD/models/ocrd-tesserocr-recognize -docker run --volume $PWD:/data --volume $PWD/models:/usr/local/share -w /data -- ocrd/all:maximum ocrd resmgr list-available``` +docker run --volume $PWD:/data --volume ocrd-models:/usr/local/share/ocrd-resources -- ocrd/all:maximum ocrd resmgr list-available +``` The output will look similar to this: @@ -64,9 +66,15 @@ The second line of each entry contains a short description of the resource. ## Installing resources -On installing resources in OCR-D, read the sections [Installing known resources](#installing-known-resources) and [Installing unknown resources](#installing-unknown-resources). +On installing resources in OCR-D, read the follow-up sections +[Installing known resources](#installing-known-resources) and +[Installing unknown resources](#installing-unknown-resources). + +*Known resources* are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters) +and are available by name to `ocrd resmgr download`. -*Known resources* are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters) and are available by name to `ocrd resmgr download`, whereas *unknown* resources are models, configurations, parameter sets etc. you provide yourself or found elsewhere on the Internet, which require passing a URL to `ocrd resmgr download`. +In contrast, *unknown* resources are models, configurations, parameter sets etc. that you provide yourself +or found elsewhere on the Internet, which require passing a URL to `ocrd resmgr download`. **If you installed OCR-D via Docker,** read the section [Models and Docker](#models-and-docker) *additionally*. @@ -126,33 +134,42 @@ ocrd-tesserocr-recognize -P model mymodel ### Models and Docker -If you are using OCR-D with Docker, we recommend keeping all downloaded resources in a persistent host directory, -separate of the OCR-D Docker container(s) and data directory, and mounting that -resource directory into a specific path in the container alongside the data directory. -The host resource directory can be empty initially. Each time you run the Docker container, -your processors will access the host directory to resolve resources, and you can download -additional models into that location using `ocrd resmgr`. +If you are using OCR-D with Docker, we recommend keeping all downloaded resources **persistently** +in a host directory, independent of both, +- the OCR-D Docker container(s) internal storage (which is transient, i.e. any change over the image + gets lost with each new `docker run`), and +- the data directory (which may be on a different filesystem). + +That resource directory needs to be mounted into a specific path in the container, as does the data directory: +- `/usr/local/share/ocrd-resources`: resource files (to be mounted as a **named volume**), +- `/data`: input/output files (to be mounted any way you like, probably a **bind mount**). + +Initially, (if you use a named volume, not a bind mount) the host resource directory will contain only +those resources that have been pre-installed into the processors' module directories. Each time you run +the Docker container, the Resource Manager and the processors will access that directory from the inside +to resolve resources, so you can download additional models into that location using `ocrd resmgr`, and +later use them in workflows. The following will assume (without loss of generality) that your host-side data -path is under `./data`, and the host-side resource path is under `./models`: +path is under `./data`, and the host-side volume is called `ocrd-models`: -To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in the container FS: +To download models to `ocrd-models` in the host FS and `/usr/local/share/ocrd-resources` in the container FS: ```sh docker run --user $(id -u) \ - --volume $PWD/models:/usr/local/share/ocrd-resources \ + --volume ocrd-models:/usr/local/share/ocrd-resources \ ocrd/all \ ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ ocrd resmgr download ocrd-calamari-recognize default\; \ ... ``` -To run processors, as usual do: +To run processors, then as usual do: ```sh docker run --user $(id -u) --workdir /data \ --volume $PWD/data:/data \ - --volume $PWD/models:/usr/local/share/ocrd-resources \ + --volume ocrd-models:/usr/local/share/ocrd-resources \ ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng ``` @@ -184,19 +201,20 @@ The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec In order of preference, a resource `` for a processor `ocrd-foo` is searched at: -* `$PWD/ocrd-resources/ocrd-foo/` +* `$PWD/` * `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/` * `/usr/local/share/ocrd-resources/ocrd-foo/` -* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/` or `$VIRTUAL_ENV/share/ocrd-foo/` +* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/` or `$VIRTUAL_ENV/share/ocrd-foo/` + (or whatever the processor's internal module location is) -(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset). +(where `$XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset). We recommend using the `$XDG_DATA_HOME` location, which is also the default. But you can override the location to store data with the `--location` option, which can be `cwd`, `data`, `system` and `module` resp. ```sh -# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +# will download to $PWD/latest_net_G.pth ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth # will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth @@ -252,8 +270,6 @@ the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter: ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA # To use your own trained model ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir -# or, to be able to control which checkpoints to use: -ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json' ``` ## Tesseract / ocrd_tesserocr @@ -262,7 +278,7 @@ Tesseract models are single files with a `.traineddata` extension. Since Tesseract only supports model lookup in a single directory, and we want to share the tessdata directory with the standalone CLI, -ocrd_tesserocr resources must be stored in the `module` location. +`ocrd_tesserocr` resources must be stored in the `module` location. If the default path of that location is not the place you want to use for Tesseract models, then either recompile Tesseract with the `tessdata` path you had in mind, or use the `TESSDATA_PREFIX` environment variable to override the `module` location at runtime. @@ -307,4 +323,4 @@ you will still have to install it, first. For information on the setup and the t # Further reading -If you just installed OCR-D and want to know how to process your own data, please see the [user guide](/en/user_guide). \ No newline at end of file +If you just installed OCR-D and want to know how to process your own data, please see the [user guide](/en/user_guide). From a07f613bf6e875549e7d3d0e6cad47426ee22eea Mon Sep 17 00:00:00 2001 From: Robert Sachunsky Date: Sat, 24 Jun 2023 00:15:36 +0200 Subject: [PATCH 3/4] model and user guide: /models instead of /usr/local/share/ocrd-resources --- site/en/models.md | 89 ++++++++++--------- site/en/user_guide.md | 193 +++++++++++++++++++++++------------------- 2 files changed, 153 insertions(+), 129 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index dbccf2474..61a4c4c1e 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -37,13 +37,12 @@ To get a list of the (available or installed) file resources that OCR-D/core ``` ocrd resmgr list-available # alternatively, using Docker: -docker run --volume $PWD:/data --volume ocrd-models:/usr/local/share/ocrd-resources -- ocrd/all:maximum ocrd resmgr list-available +docker run --volume ocrd-models:/models -- ocrd/all:maximum ocrd resmgr list-available ``` The output will look similar to this: ``` - ocrd-calamari-recognize - qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz) Calamari model trained with GT4HistOCR @@ -70,56 +69,60 @@ On installing resources in OCR-D, read the follow-up sections [Installing known resources](#installing-known-resources) and [Installing unknown resources](#installing-unknown-resources). -*Known resources* are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters) +*Known* resources are resources that are provided by processor developers [in the `ocrd-tool.json`](/en/spec/ocrd_tool#file-parameters) and are available by name to `ocrd resmgr download`. -In contrast, *unknown* resources are models, configurations, parameter sets etc. that you provide yourself -or found elsewhere on the Internet, which require passing a URL to `ocrd resmgr download`. +*Unknown* resources, in contrast, are models, configurations, parameter sets etc. that you provide yourself +or found elsewhere on the Internet, which require passing a URL (or local path) to `ocrd resmgr download`. **If you installed OCR-D via Docker,** read the section [Models and Docker](#models-and-docker) *additionally*. ### Installing known resources You can install resources with the `ocrd resmgr download` command. It expects -the name of the processor as the first argument and either the name or URL of a -resource as a second argument. +the name of the processor as the 1st argument and the name of a resource as a 2nd argument. -Although model distribution is not currently centralised within OCR-D, we -are working towards a central model repository. +Since model distribution is decentralised within OCR-D, every processor can advertise its +own known resources, which the resource manager then picks up. For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`: ``` ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz -# or -ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz ``` This will look up the resource in the [bundled resource and user databases](#user-database), download, unarchive (where applicable) and store it in the [proper location](#where-is-the-data). -**NOTE:** The special name `*` can be used instead of a resource name/url to -download *all* known resources for this processor. To download all tesseract models: +> **NOTE:** The special name `*` can be used instead of a resource name/url to +> download *all* known resources for this processor. To download all tesseract models: ```sh ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource -to download *all* known resources for *all* installed processors: +> **NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource +> to download *all* known resources for *all* installed processors: ```sh ocrd resmgr download '*' ``` -(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.) +> (In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.) ### Installing unknown resources -If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`: +If you need to install a resource which OCR-D does not know of, that can be achieved by passing +its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`. -To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`. +For example, to install the same model for `ocrd-cis-ocropy-recognize` as above: + +``` +ocrd resmgr download -n https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz +``` + +Or to install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`: ``` ocrd resmgr download -n https://my-server/mymodel.traineddata ocrd-tesserocr-recognize mymodel.traineddata @@ -135,29 +138,30 @@ ocrd-tesserocr-recognize -P model mymodel ### Models and Docker If you are using OCR-D with Docker, we recommend keeping all downloaded resources **persistently** -in a host directory, independent of both, -- the OCR-D Docker container(s) internal storage (which is transient, i.e. any change over the image - gets lost with each new `docker run`), and -- the data directory (which may be on a different filesystem). +in a host directory, independent of both: +- the Docker container's internal storage (which is transient, i.e. any change over the image + gets lost with each new `docker run`), +- the host's data directory (which may be on a different filesystem). That resource directory needs to be mounted into a specific path in the container, as does the data directory: -- `/usr/local/share/ocrd-resources`: resource files (to be mounted as a **named volume**), -- `/data`: input/output files (to be mounted any way you like, probably a **bind mount**). +- `/models`: resource files (to be mounted as a **named volume**, e.g. `-v ocrd-models:/models`), +- `/data`: input/output files (to be mounted any way you like, probably a **bind mount**, e.g. `-v $PWD:/data`), +- `/tmp`: temporary files (ideally as **tmpfs**, e.g. `--tmpfs /tmp`) -Initially, (if you use a named volume, not a bind mount) the host resource directory will contain only -those resources that have been pre-installed into the processors' module directories. Each time you run +Initially, (if you use a named volume, not a bind mount,) the host resource directory will contain only +those resources that have been **pre-installed** into the processors' module directories. Each time you run the Docker container, the Resource Manager and the processors will access that directory from the inside -to resolve resources, so you can download additional models into that location using `ocrd resmgr`, and -later use them in workflows. +to resolve resources, so you can **download additional** models into that location using `ocrd resmgr`, and +later **use them** in workflows. The following will assume (without loss of generality) that your host-side data path is under `./data`, and the host-side volume is called `ocrd-models`: -To download models to `ocrd-models` in the host FS and `/usr/local/share/ocrd-resources` in the container FS: +To download models to `ocrd-models` in the host FS and `/models` in the container FS: ```sh docker run --user $(id -u) \ - --volume ocrd-models:/usr/local/share/ocrd-resources \ + --volume ocrd-models:/models \ ocrd/all \ ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ ocrd resmgr download ocrd-calamari-recognize default\; \ @@ -167,9 +171,10 @@ ocrd resmgr download ocrd-calamari-recognize default\; \ To run processors, then as usual do: ```sh -docker run --user $(id -u) --workdir /data \ +docker run --user $(id -u) \ + --tmpfs /tmp \ --volume $PWD/data:/data \ - --volume ocrd-models:/usr/local/share/ocrd-resources \ + --volume ocrd-models:/models \ ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng ``` @@ -183,17 +188,18 @@ resources and lists URL and description if a database entry exists. ## User database -Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install -a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at -`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist. +Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem, or when you install +a resource with `ocrd resmgr download`, it will add a new stub entry in the user database, which is found at +`$XDG_CONFIG_HOME/ocrd/resources.yml` (where `$XDG_CONFIG_HOME` defaults to `$HOME/.config` if unset) and +gets created if it does not exist. This allows you to use the OCR-D/core resource manager mechanics, including lookup of known resources by name or URL, without relying (only) on the database maintained by the OCR-D/core developers. -**NOTE:** If you produced or found resources that are interesting for the wider -OCR(-D) community, please tell us in the [OCR-D gitter -chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database. +> **NOTE:** If you produced or found resources that are interesting for the wider +> OCR(-D) community, please tell us in the [OCR-D gitter chat](https://gitter.im/OCR-D/Lobby) +> or open an issue in the respective Github repository, so we can add it to the database. ## Where is the data @@ -203,8 +209,8 @@ In order of preference, a resource `` for a processor `ocrd-foo` is search * `$PWD/` * `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/` -* `/usr/local/share/ocrd-resources/ocrd-foo/` -* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/` or `$VIRTUAL_ENV/share/ocrd-foo/` +* `/usr/local/share/ocrd-resources/ocrd-foo/` +* `$VIRTUAL_ENV/lib/python3.6/site-packages/ocrd-foo/` or `$VIRTUAL_ENV/share/ocrd-foo/` (or whatever the processor's internal module location is) (where `$XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset). @@ -213,6 +219,9 @@ We recommend using the `$XDG_DATA_HOME` location, which is also the default. But you can override the location to store data with the `--location` option, which can be `cwd`, `data`, `system` and `module` resp. +In Docker though, `$XDG_CONFIG_HOME=$XDG_DATA_HOME/ocrd-resources=/usr/local/share/ocrd-resources` +gets symlinked to `/models` for easier volume handling (and persistency). + ```sh # will download to $PWD/latest_net_G.pth ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth diff --git a/site/en/user_guide.md b/site/en/user_guide.md index f491788e3..303200c03 100644 --- a/site/en/user_guide.md +++ b/site/en/user_guide.md @@ -17,47 +17,53 @@ started with OCR-D after the installation as detailed in the very next two parag obligatory for both Docker and Non-Docker users! Furthermore, Docker commands have a [different syntax than native calls](#translating-native-commands-to-docker-calls). -This guide always states native calls first and then provides the respective command for Docker users. +This guide always states native calls first but follows up with the respective command for Docker users. ## Preparations -### Docker installation: +### Docker installation: Run container If you are using the Installation via Docker, we recommend running an interactive shell session in the container: ```sh -docker run --user $(id -u) --volume $PWD:/data --volume ocrd-models:/usr/local/share/ocrd-resources -it ocrd/all bash +docker run --user $(id -u) --tmpfs /tmp --volume $PWD:/data --volume ocrd-models:/models -it ocrd/all bash ``` -After spinning up the container, you can use the installation and call the processors the same way as in the native installation. - -Alternatively, you can [translate each command to a docker call](/en/user_guide#translating-native-commands-to-docker-calls). +After spinning up the container, you can use the internal installation and call the processors +the same way as in the native installation. Alternatively, you can +[translate each command to a docker call](/en/user_guide#translating-native-commands-to-docker-calls) separately. ### Native installation: Activate virtual environment -If you are using a native installation, you should activate the -virtualenv before starting to work with the OCR-D-software. This has either been installed automatically if you installed the -software via ocrd_all, or you should have [installed it yourself](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) before -installing the OCR-D-software individually. Note that you need to specify the path to your virtualenv. If you are simply using the `venv` is created -on-demand by `ocrd_all`, it is contained in your `ocrd_all` directory +If you are using a native installation, you must activate the venv before you can start +working with the OCR-D software. It has either been created automatically if you +[installed the software via ocrd_all](setup), or you should have manually +[installed it yourself](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) +before installing the OCR-D software. + +To activate, you need to specify the path to your venv. In the automatic `ocrd_all` case, +it has simply been created under `venv` in your `ocrd_all` directory: ```sh -source ~/venv/bin/activate +# example with manually created venv: +$ source ~/venv/bin/activate -# e.g. for your `ocrd_all` venv -habocr@ocrtest:~$ source ocrd_all/venv/bin/activate -(venv) habocr@ocrtest:~$ +# example for automatically created venv: +$ source ocrd_all/venv/bin/activate + +# when the shell loads the venv, the prompt will change: +(venv) $ ``` -Once you have activated the virtualenv, you should see `(venv)` prepended to -your shell prompt. +Once you have activated the venv in your shell, you should see its name prepended to +the command prompt. -When you are done with your OCR-D-work, you can use `deactivate` to deactivate -your venv. +When you are done with your OCR-D work, you can use `deactivate` to deactivate +the venv (or just terminate the shell). ### Preparing a workspace @@ -67,7 +73,7 @@ i.e. special directories which contain the images to be processed and their corr METS file. Any files generated while processing these images with the OCR-D software will also be stored in this directory. -How you prepare a workspace depends on whether or not you already have a METS file +How you prepare a workspace **depends** on whether or not you **already** have a METS file with the paths (or URLs) to the images you want to process. For usage within OCR-D your METS file should look similar to [this example](example_mets). @@ -78,7 +84,7 @@ and load the pictures to be processed with the following command: ```sh ocrd workspace [-d path/to/workspace] clone URL_OF_METS -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace clone [-d path/to/your/workspace] URL_OF_METS ``` @@ -97,22 +103,22 @@ List all existing groups: ```sh ocrd workspace [-d /path/to/your/workspace] list-group -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] list-group ``` -This will provide you with the names of all the different file groups in your METS, e.g. THUMBNAILS, -PRESENTATION, MAX. +This will provide you with the names of all the different file groups in your METS, e.g. `THUMBNAILS`, +`PRESENTATION`, `MAX`. -Download all files of one group: +Download all files of one file group: ```sh ocrd workspace [-d path/to/your/workspace] find --file-grp [selected file group] --download -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] find --file-grp [selected file group] --download ``` -This will download all images in the specified file group and save them in a folder named accordingly +This will download all images in the specified file group and save them in a directory named accordingly in your workspace. You are now ready to start processing your images with OCR-D. #### Non-existing METS @@ -123,7 +129,7 @@ workspace: ```sh ocrd workspace [-d path/to/your/workspace] init -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] init ``` @@ -132,46 +138,50 @@ the directory argument** if you want to use the current working directory as target. For repeated use, we recommend a `cd path/to/your/workspace` once, so in subsequent operations, the argument can be omitted.) -This will create a file `mets.xml` within the target directory. +This will **create** a file `mets.xml` within the target directory. Then you can set a unique `mods:identifier` … ```sh ocrd workspace [-d path/to/your/workspace] set-id 'unique ID' -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] set-id 'unique ID' ``` -… and copy the directory containing the images to be processed into the workspace directory: +… and copy or symlink the directory containing the images to be processed into the workspace directory: ```sh cp -r path/to/your/images [path/to/your/workspace/]. +ln -s path/to/your/images [path/to/your/workspace/]. ``` -Now you can add images to the empty METS created above by adding -references for their path names. +Now you can add those images to the empty METS created above, +by **adding references** for their path names. You can do this in a number of ways. -You can do this in a number of ways. Either with the following simple command: +Either with the following simple command: ```sh -ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page (must start with a letter)} -G {name of image fileGrp} -i {ID of the image file (must start with a letter)} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} -## alternatively, using docker: -docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page (must start with a letter)} -G {name of image fileGrp} -i {ID of the image file (must start with a letter)} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} +ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page} -G {name of image fileGrp} -i {ID of the image file} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} +## alternatively, using Docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace [-d path/to/your/workspace] add -g {ID of the physical page} -G {name of image fileGrp} -i {ID of the image file} -m image/{MIME format of that image} {path/to/that/image/file/in/workspace} ``` +> **Note**: Identifiers in XML must always [start with a letter](https://www.w3.org/TR/REC-xml/#NT-Names). + For example, your simple commands could look like this: ```sh ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00001.tif ocrd workspace add -g P_00002 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00002.tif ... -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff images/00001.tif docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -g P_00002 -G OCR-D-IMG -i OCR-D-IMG_00002 -m image/tiff images/00002.tif +... ``` Or, if you have lots of images to be added to the METS, you can do this automatically with a `for` loop: -> **Note:** For this method, all images must have the same format (tiff, jpeg, ...) +> **Note**: For this method, all images must have the same format (tiff, jpeg, ...) ```sh FILEGRP="OCR-D-IMG" # name of fileGrp to use @@ -182,7 +192,7 @@ for path in images/*$EXT; do base=`basename $path $EXT`; ## using local ocrd CLI: ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_$base -m $MEDIATYPE $path - ## alternatively, using docker: + ## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -G $FILEGRP -i ${FILEGRP}_${base} -g P_$base -m $MEDIATYPE $path done ``` @@ -191,7 +201,7 @@ For example, your `for` loop could look like this: ```sh for path in images/*.tif; do base=`basename $path .tif`; ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_$base -g P_$base -m image/tiff $path; done -## alternatively, using docker: +## alternatively, using Docker: for path in images/*.tif; do base=`basename $path .tif`; docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd workspace add -G OCR-D-IMG -i OCR-D-IMG_$base -g P_$base -m image/tiff $path; done ``` @@ -199,7 +209,7 @@ The log information should inform you about every image which was added to the M In the end, your `mets.xml` should look like this [example METS](example_mets). You are now ready to start processing your images with OCR-D. -Finally, the shell script `ocrd-import` from [workflow-configuration](#workflow-configuration) +Finally, the shell script `ocrd-import` from [workflow-configuration](https://github.com/bertsky/workflow-configuration) is a tool which does all of the above (and can also convert arbitrary image formats and extract from PDFs) automatically. For usage options, see: @@ -216,7 +226,7 @@ while ignoring other files, and finally write everything to `path/to/your/images ```sh ocrd-import --nonnum-ids --ignore --render 300 path/to/your/images -## alternatively using docker +## alternatively using Docker docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-import -P -i -r 300 path/to/your/images ``` @@ -231,32 +241,33 @@ a fileGrp `OCR-D-IMG` referencing your local image files. ## Using the OCR-D-processors -### OCR-D-Syntax +### OCR-D Syntax -There are several ways for invoking the OCR-D-processors. However, all of those -ways make use of the following syntax: +There are several ways for invoking the OCR-D processors. Still, all of them +make use of the following syntax: ```sh --I Input-Group # folder of the files to be processed --O Output-Group # folder for the output of your processor --P parameter # indication of parameters for a particular processor +-I Input-Group # fileGrp of the files to be processed +-O Output-Group # fileGrp of the files results +-P parameter value # (direct assignment of parameters for a particular processor) +-p parameter-file # (file-based assignment of parameters for a particular processor) +-g page-range # (range of physical pages to be processed) ``` -**Note:** The `-P` option accepts a parameter name and a parameter value. When we write `-P parameter`, we mean that `parameter` consists of -`parameter name` and `parameter value`. -For some processors parameters are purely optional, other processors as e.g. `ocrd-tesserocr-recognize` won't work without one or several parameters. +> **Note**: For some processors, all parameters are optional, while other processors such as +> `ocrd-tesserocr-recognize` will not work without some parameter specifications. ### Calling a single processor If you just want to call a single processor, e.g. for testing purposes, you can go into your workspace and use the following command: ```sh -ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter file}] [-P {parameter} {value}] -## alternatively, using docker: -docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter file}] [-P {parameter} {value}] +ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] +## alternatively, using Docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] ``` For example, your processor call command could look like this: ```sh ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola ``` @@ -265,22 +276,22 @@ binarize them and write the results in fileGrp `Ouput-Group` in your workspace (i.e. both as files on the filesystem and referenced in the `mets.xml`). It will also add information about this processing step in the METS metadata. -> **Note:** For processors using multiple input- or output fileGrps you have to use a comma-separated list. +> **Note**: For processors using multiple input- or output fileGrps you have to use a comma-separated list. E.g.: ```sh ocrd-cor-asv-ann-align -I OCR-D-OCR1,OCR-D-OCR2,OCR-D-OCR3 -O OCR-D-OCR4 -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-cor-asv-ann-align -I OCR-D-OCR1,OCR-D-OCR2,OCR-D-OCR3 -O OCR-D-OCR4 ``` -> **Note:** If multiple parameter key-value pairs are necessary, each of them has to be preceded by `-P` as in +> **Note**: If multiple parameter key-value pairs are necessary, each of them has to be preceded by `-P` as in ```sh ... -P param1 value1 -P param2 value2 -P param3 value3 ``` -> **Note:** If a value consists of several words with whitespaces, they have to be enclosed in quotation marks +> **Note**: If a value consists of several words with whitespaces, they have to be enclosed in quotation marks > (to prevent the shell from splitting them up) as in ```sh -P param "value value" @@ -297,7 +308,7 @@ If you quickly want to specify a particular workflow on the CLI, you can use ocrd process \ '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group}' \ '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group} -P {parameter} {value}' -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd process \ '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group}' \ '{processor needed without prefix 'ocrd-'} -I {Input-Group} -O {Output-Group} -P {parameter} {value}' @@ -311,7 +322,7 @@ ocrd process \ 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \ 'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \ 'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR -P model Fraktur' -## alternatively, using docker: +## alternatively, using Docker: docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd process \ 'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE' \ 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \ @@ -331,30 +342,31 @@ by the second processor (e.g. segmented) etc. So In the end your workspace should contain a directory (and fileGrp) with (intermediate) processing results for each output fileGrp specified in the workflow. -> **Note:** In contrast to calling a single processor, for `ocrd process` you leave +> **Note**: In contrast to calling a single processor, for `ocrd process` you leave out the prefix `ocrd-` before the name of a particular processor. For information on the available processors see [section at the end](#get-more-information-about-processors). -#### workflow-configuration +#### ocrd-make -`ocrd-make` is another tool for specifying OCR-D workflows and running them. It combines GNU `parallel` with GNU `make` -as workflow engine, treating document processing like software builds (including incremental and parallel computation). -Configurations are just makefiles, targets are workspaces and their file groups. +`ocrd-make` from [workflow-configuration](https://github.com/bertsky/workflow-configuration) +is another tool for specifying OCR-D workflows and running them. It combines GNU `parallel` with GNU `make` +as workflow engine, treating document processing like software builds (including incremental and parallel +computation). Configurations are just makefiles, workspaces and their file groups are just targets. -It is included in [ocrd_all](https://github.com/OCR-D/ocrd_all), therefore you most likely already installed it along -with the other OCR-D processors. +It is included in [ocrd_all](https://github.com/OCR-D/ocrd_all), therefore you most likely already +[installed it](setup) along with the other OCR-D processors. -The `workflow-configuration` directory already contains several example workflows, which were tested against the -Ground Truth provided by OCR-D. For CER results of those workflows in our tests see [the table on GitHub](https://github.com/bertsky/workflow-configuration#usage). +> **Note**: The `workflow-configuration` distribution contains several example workflows, which were tested +> against the Ground Truth provided by OCR-D. For CER results of those workflows in our tests see +> [the table on GitHub](https://github.com/bertsky/workflow-configuration#usage). However, most workflows +> are configured for GT data, i.e. they expect preprocessed images which were already segmented +> at least down to line level. If you want to run them on naked images, you have to add some preprocessing +> and segmentation steps first, otherwise they will fail. -> **Note:** Most workflows are configured for GT data, i.e. they expect preprocessed images which were already segmented -> at least down to line level. If you want to run them on raw images, you have to add some preprocessing and segmentation steps first. -> Otherwise they will fail. - -In order to run a workflow, change into your data directory (that contains the workspaces) and call the desired configuration file -on your workspace(s): +In order to run a workflow, change into your data directory (which contains the workspaces) and call +the desired configuration file on your workspace(s): ```sh ocrd-make -f {name_of_your_workflow.mk} [/path/to/your/workspace1] [/path/to/your/workspace2] ... @@ -393,10 +405,10 @@ nano {name_of_your_new_workflow_configuration.mk} You can write new rules by using file groups as prerequisites/targets in the normal GNU make syntax. The first target defined must be the default goal that builds the very last file group for that configuration. -Alternatively a variable `.DEFAULT_GOAL` pointing to that target can be set anywhere in the makefile. +Alternatively, a variable `.DEFAULT_GOAL` pointing to that target can be set anywhere in the makefile. -> **Note:** Also see the [extensive Readme of workflow-configuration](https://bertsky.github.io/workflow-configuration) -> on how to adjust the preconfigured workflows to your needs. +> **Note**: Also see the [extensive Readme of workflow-configuration](https://bertsky.github.io/workflow-configuration) +> on how to write workflows or adjust the preconfigured workflows to your needs. #### Translating native commands to Docker calls The command calls presented above are easy to translate for use in our @@ -414,15 +426,15 @@ ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG … to run it with the [`ocrd/all:maximum`] Docker container … ```sh -docker run -u $(id -u) -v $PWD:/data -v ocrd-models:/usr/local/share/ocrd-resources -- ocrd/all:maximum ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG - \_________/ \___________/ \_____________________________________________/ \________________/ \______________________________________________/ - (1) (2) (3) (4) (5) +docker run -u $(id -u) -v $PWD:/data -v ocrd-models:/models -- ocrd/all:maximum ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG + \_________/ \___________/ \_____________________/ \________________/ \______________________________________________/ + (1) (2) (3) (4) (5) ``` * (1) tells Docker to run the container as the calling user (who should have write access to the CWD) instead of root * (2) tells Docker to bind-mount the current working directory (CWD) under `/data` in the container -* (3) tells Docker to mount `/usr/local/share/ocrd-resources` in the container (i.e. the location for all models) under the **named volume** `ocrd-models` +* (3) tells Docker to mount `/models` in the container (i.e. the location for all models) under the **named volume** `ocrd-models` * (4) tells Docker which image to spawn a container for * (5) is the unchanged call to the processor @@ -434,6 +446,9 @@ docker run -u $(id -u) -v $PWD:/data -v ocrd-models:/usr/local/share/ocrd-resour > **Note**: It can also be useful to have Docker automatically delete the container after termination > by adding the `--rm` option. +> **Note**: It can also be useful to have Docker mount `/tmp` in the container to faster memory, +> which can be done via `--tmpfs /tmp` (for a RAM disk) or something like `-v /nvram:/tmp`. + ### Specifying new OCR-D workflows When you want to specify a new workflow adapted to the features of particular @@ -449,7 +464,7 @@ and/or processors. For an overview on the existing processors, their tasks and f To get all available processors you might use the autocomplete in your preferred console. > **Note**: If you installed OCR-D via Docker, make sure you run the interactive bash shell -> on the ocrd/all Docker image as described in the section [Preparations](#docker-installation). +> on the ocrd/all Docker image as described in the section [Preparations](#docker-installation-run-container). > If you installed OCR-D natively, activate the virtual environment first as described in the section > [Preparations](#native-installation-activate-virtual-environment). @@ -458,9 +473,9 @@ Type `ocrd-` followed by a tab character (for autocompletion proposals) to get a To get further information about a particular processor, call it with `--help` or `-h`: ```sh -[name_of_selected_processor] --help -## alternatively, using docker: -docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum [name_of_selected_processor] --help +{processor name} --help +## alternatively, using Docker: +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum {processor name} --help ``` @@ -468,6 +483,6 @@ docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum [name_of_selected_ Several processors rely on models, which usually have to be downloaded beforehand. An overview on the existing model repositories and short descriptions on the most important models -can be found [in our models documentation](https://ocr-d.de/en/models). +can be found [in our Models Guide](https://ocr-d.de/en/models). We strongly recommend to use the [OCR-D resource manager](https://ocr-d.de/en/models) to download the models, as this makes it easy to both download and use them. From 3842c6f2d9229969ac09c764f12fe4f81b4eeebe Mon Sep 17 00:00:00 2001 From: Robert Sachunsky Date: Sat, 24 Jun 2023 00:49:30 +0200 Subject: [PATCH 4/4] model and user guide: improve markdown --- site/en/models.md | 16 ++++++------ site/en/user_guide.md | 60 +++++++++++++++++++++++++------------------ 2 files changed, 43 insertions(+), 33 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 61a4c4c1e..50c51f734 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -95,14 +95,14 @@ This will look up the resource in the [bundled resource and user databases](#use unarchive (where applicable) and store it in the [proper location](#where-is-the-data). -> **NOTE:** The special name `*` can be used instead of a resource name/url to +> **Note**: The special name `*` can be used instead of a resource name/url to > download *all* known resources for this processor. To download all tesseract models: ```sh ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -> **NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource +> **Note**: Equally, the special processor `*` can be used instead of a processor and a resource > to download *all* known resources for *all* installed processors: ```sh @@ -162,10 +162,10 @@ To download models to `ocrd-models` in the host FS and `/models` in the containe ```sh docker run --user $(id -u) \ --volume ocrd-models:/models \ -ocrd/all \ -ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ -ocrd resmgr download ocrd-calamari-recognize default\; \ -... + ocrd/all \ + ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ + ocrd resmgr download ocrd-calamari-recognize default\; \ + ... ``` To run processors, then as usual do: @@ -197,7 +197,7 @@ This allows you to use the OCR-D/core resource manager mechanics, including lookup of known resources by name or URL, without relying (only) on the database maintained by the OCR-D/core developers. -> **NOTE:** If you produced or found resources that are interesting for the wider +> **Note**: If you produced or found resources that are interesting for the wider > OCR(-D) community, please tell us in the [OCR-D gitter chat](https://gitter.im/OCR-D/Lobby) > or open an issue in the respective Github repository, so we can add it to the database. @@ -255,7 +255,7 @@ To use a specific model with OCR-D's ocropus wrapper in ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz ``` -**NOTE:** Model must be downloade before with +> **Note**: The model must have been downloaded before with ```sh ocrd resmgr download ocrd-cis-ocropy-recognize fraktur-jze.pyrnn.gz diff --git a/site/en/user_guide.md b/site/en/user_guide.md index 303200c03..21e3fc5f4 100644 --- a/site/en/user_guide.md +++ b/site/en/user_guide.md @@ -10,14 +10,17 @@ title: User Guide for Non-IT Users # User Guide for Non-IT Users -The following guide provides a detailed description on how to use the OCR-D-Software after it has been installed successfully. As explained in the -setup guide, you can either use the [OCR-D-Docker-solution](https://ocr-d.github.io/en/setup#ocrd_all-via-docker), or you can -[install the Software locally](https://ocr-d.github.io/en/setup#ocrd_all-natively). Note that these two options require different prerequisites to get -started with OCR-D after the installation as detailed in the very next two paragraphs. The [third preparatory step](#preparing-a-workspace) is -obligatory for both Docker and Non-Docker users! +The following guide provides a detailed description on how to use the OCR-D software after it has been +[installed](setup) successfully. As explained in the [Setup Guide](setup), you can either use the +[OCR-D Docker solution](https://ocr-d.github.io/en/setup#ocrd_all-via-docker), or you can +[install the Software natively](https://ocr-d.github.io/en/setup#ocrd_all-natively) on your OS. -Furthermore, Docker commands have a [different syntax than native calls](#translating-native-commands-to-docker-calls). -This guide always states native calls first but follows up with the respective command for Docker users. +Depending on which option you prefer, you will require different steps to run OCR-D, as detailed +in the following two paragraphs. (The [third paragraph](#preparing-a-workspace) is obligatory +for both Docker and native users.) + +Docker commands need a [extra syntax over native commands](#translating-native-commands-to-docker-calls). +This guide always states native calls first, but follows up with the respective command for Docker. ## Preparations @@ -239,9 +242,9 @@ a fileGrp `OCR-D-IMG` referencing your local image files. > when copying and pasting from the sample calls provide on this website. -## Using the OCR-D-processors +## Using the OCR-D processors -### OCR-D Syntax +### OCR-D command-line interface syntax There are several ways for invoking the OCR-D processors. Still, all of them make use of the following syntax: @@ -257,12 +260,15 @@ make use of the following syntax: > **Note**: For some processors, all parameters are optional, while other processors such as > `ocrd-tesserocr-recognize` will not work without some parameter specifications. +For information on the available processors, and their respective parameters, +see [getting more information about processors](#get-more-information-about-processors). + ### Calling a single processor -If you just want to call a single processor, e.g. for testing purposes, you can go into your workspace and use the following command: +If you just want to run a single processor, you can go into your workspace and use the following command: ```sh -ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] +ocrd-{processor name} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] ## alternatively, using Docker: -docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor needed} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor name} -I {Input-Group} -O {Output-Group} [-p {parameter-file}] [-P {parameter} {value}] ``` For example, your processor call command could look like this: ```sh @@ -278,7 +284,7 @@ It will also add information about this processing step in the METS metadata. > **Note**: For processors using multiple input- or output fileGrps you have to use a comma-separated list. -E.g.: +For example: ```sh ocrd-cor-asv-ann-align -I OCR-D-OCR1,OCR-D-OCR2,OCR-D-OCR3 -O OCR-D-OCR4 @@ -299,10 +305,17 @@ docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-cor-asv-ann-a ### Calling several processors +Running several processors one after another on the same data is called a **workflow**. +For workflow processing, you need a workflow format and a workflow engine. + +In the most simple case, you just write a shell script which combines single processor +calls in a command sequence joined by `&&`. The following paragraphs will describe more +advanced options. + #### ocrd process If you quickly want to specify a particular workflow on the CLI, you can use -`ocrd process`, which has a similar syntax as calling single processor CLIs. +`ocrd process`, which has a similar syntax as calling single processor CLIs: ```sh ocrd process \ @@ -336,17 +349,14 @@ in your workspace (i.e. both as files on the filesystem and referenced in the `m It will also add information about this processing step in the METS metadata. The processors work on the files sequentially. So at first, all pages will be processed -with the first processor (e.g. binarized), then all pages will be processed +with the first processor (e.g. binarized), then (if successful) all pages will be processed by the second processor (e.g. segmented) etc. -So In the end your workspace should contain a directory (and fileGrp) with (intermediate) +So in the end, your workspace should contain a directory (and fileGrp) with (intermediate) processing results for each output fileGrp specified in the workflow. > **Note**: In contrast to calling a single processor, for `ocrd process` you leave -out the prefix `ocrd-` before the name of a particular processor. - -For information on the available processors see [section at the end](#get-more-information-about-processors). - +> out the prefix `ocrd-` before the name of a particular processor. #### ocrd-make @@ -423,7 +433,7 @@ look like this … ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG ``` -… to run it with the [`ocrd/all:maximum`] Docker container … +… to run it with the [`ocrd/all:maximum`](https://hub.docker.com/r/ocrd/all/tags) Docker container … ```sh docker run -u $(id -u) -v $PWD:/data -v ocrd-models:/models -- ocrd/all:maximum ocrd-tesserocr-segment -I OCR-D-IMG -O OCR-D-SEG @@ -459,7 +469,7 @@ and/or processors. For an overview on the existing processors, their tasks and f -### Get more Information about Processors +### Get more information about processors To get all available processors you might use the autocomplete in your preferred console. @@ -473,9 +483,9 @@ Type `ocrd-` followed by a tab character (for autocompletion proposals) to get a To get further information about a particular processor, call it with `--help` or `-h`: ```sh -{processor name} --help +ocrd-{processor name} --help ## alternatively, using Docker: -docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum {processor name} --help +docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum ocrd-{processor name} --help ``` @@ -483,6 +493,6 @@ docker run --rm -u $(id -u) -v $PWD:/data -- ocrd/all:maximum {processor name} - Several processors rely on models, which usually have to be downloaded beforehand. An overview on the existing model repositories and short descriptions on the most important models -can be found [in our Models Guide](https://ocr-d.de/en/models). +can be found in our [Models Guide](https://ocr-d.de/en/models). We strongly recommend to use the [OCR-D resource manager](https://ocr-d.de/en/models) to download the models, as this makes it easy to both download and use them.