Skip to content

Workflow Guide region segmentation

Konstantin Baierer edited this page Feb 3, 2022 · 14 revisions

In this processing step, an (optimized) document image is taken as an input and the image is segmented into the various regions, including columns. Segments are also classified, either coarse (text, separator, image, table, ...) or fine-grained (paragraph, marginalia, heading, ...).

Note: The ocrd-tesserocr-segment, ocrd-tesserocr-recognize, ocrd-eynollah-segment, ocrd-sbb-textline-detector and ocrd-cis-ocropy-segment processors do not only segment the page, but also the text lines within the detected text regions in one step. Therefore with those (and only with those!) processors you don't need to segment into lines in an extra step and can continue with step 13 - line-level dewarping.

Note: If you use ocrd-tesserocr-segment-region, which uses only bounding boxes instead of polygon coordinates, then you should post-process via ocrd-segment-repair with plausibilize=True to obtain better results without large overlaps. Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract's internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap). Alternatively, run with shrink_polygons=True (accessing that same iterator to calculate convex hull polygons).

Note: All the ocrd-tesserocr-segment* processors internally delegate to ocrd-tesserocr-recognize, so you can replace calls to these task-specific processors with calls to ocrd-tesserocr-recognize with specific parameters:

processor call ocrd-tesserocr-recognize parameters
ocrd-tesserocr-segment-region -P overwrite_regions true ocrd-tesserocr-recognize -P textequiv_level region -P segmentation_level region -P overwrite_segments true
ocrd-tesserocr-segment-table -P overwrite_cells true ocrd-tesserocr-recognize -P textequiv_level cell -P segmentation_level cell -P overwrite_segments true
ocrd-tesserocr-segment-line -P overwrite_lines true ocrd-tesserocr-recognize -P textequiv_level line -P segmentation_level line -P overwrite_segments true
ocrd-tesserocr-segment-word -P overwrite_words true ocrd-tesserocr-recognize -P textequiv_level word -P segmentation_level word -P overwrite_segments true

Note: The three parameters segmentation_level, textequiv_level and model define the behavior of ocrd-tesserocr-recognize:

  • segmentation_level determines the highest level to segment. Use "none" to disable segmentation altogether, i.e. only recognize existing segments.
  • textequiv_level determines the lowest level to segment. Use "none" to segment until the lowest level ("glyph") and disable recognition altogether, only analyse layout.
  • model determines the model to use for text recognition. Use "" or do not set at all to disable recognition, i.e. only analyse layout.

Examples:

  • To segment existing regions into lines (and only lines) only: segmentation_level="line", textequiv_level="line", model=""
  • To segment existing regions into lines (and only lines) and recognize text: segmentation_level="line", textequiv_level="line", model="Fraktur"

For detailed descriptions of behaviour and options, see tesserocr's README and ocrd-tesserocr-recognize/segment/segment-region/segment-table/segment-line/segment-word --help help.

   

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true Recommended. Will reuse internal tesseract iterators to produce a complete segmentation with tight polygons instead of bounding boxes where possible ocrd-tesserocr-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG -P find_tables false -P shrink_polygons true
ocrd-eynollah-segment -P models Models can be found here or downloaded with the OCR-D resource manager;
If you didn't download the model with the resmgr, for model you need to pass the absolute path on your hard drive as parameter value.
ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG -P models default
ocrd-sbb-textline-detector -P model modelname Models can be found here or downloaded with the OCR-D resource manager;
If you didn't download the model with resmgr, for model you need to pass the local filesystem path as parameter value.
ocrd-sbb-textline-detector -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P model /path/to/model
ocrd-cis-ocropy-segment -P level-of-operation page ocrd-cis-ocropy-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P level-of-operation page
ocrd-tesserocr-segment-region -P find_tables false Recommended ocrd-tesserocr-segment-region -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P find_tables false -P shrink_polygons true
ocrd-segment-repair -P plausibilize true Only to be used after ocrd-tesserocr-segment-region ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true
ocrd-anybaseocr-block-segmentation -P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5 For available models take a look at this site ocr download them via OCR-D resource manager; If you didn't use resmgr, you need to pass the local filesystem path as parameter value. ocrd-anybaseocr-block-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5
ocrd-pc-segmentation ocrd-pc-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG
ocrd-detectron2-segment For available models, any model for Detectron2 forks trained on document layout analysis datasets can be integrated; instructions and examples can be found here

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

ocrd-tesserocr-segment-region tends to produce floating_regions on non-standard layout like lists, e.g. found in newspapers. It furthermore struggles with multicolumn texts like http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24

ocrd-sbb-textline-detector does no segmentation into headings, paragraph and regions, but is quite good with finding text regions.

ocrd-tesserocr-segment-region with -P find_tables true subsequently needs a separate segmentation step for the table regions using ocrd-tesserocr-segment-table (see https://github.com/OCR-D/ocrd_tesserocr/issues/134 and https://github.com/OCR-D/ocrd_all/issues/190 for details) or ocrd-cis-ocropy-segment with -P level-of-operation table.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally