workflows.md, Step 7 #268

jbarth-ubhd · 2021-11-04T10:29:18Z

Examples:
* To segment existing regions into lines (and only lines) only: 
    `segmentation_level="line"`, `textequiv_level="line"`, `model=""`
* To segment existing regions into lines (and only lines) and recognize text:
    `segmentation_level="line"`, `textequiv_level="line"`, `model="Fraktur"`

I'm missing the word region in the parameters (regions→lines)

The text was updated successfully, but these errors were encountered:

jbarth-ubhd · 2021-11-04T10:38:20Z

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

— what about giving grades & est. processing times & memory requirements to processors?

jbarth-ubhd · 2021-11-04T11:02:03Z

PS: ocrd-tesserocr-segment* (recommended) are not in the »Best results for selected pages« workflow. (see below)

jbarth-ubhd · 2021-11-04T11:05:52Z

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

— ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

jbarth-ubhd · 2021-11-04T11:11:43Z

Best results for selected pages — workflow

cis-ocropy-binarize is not recommended(?)
skimage-binarize is not recommended(?)
tesserocr-deskew is not recommended(?)
cis-ocropy-segment is not recommended(?)

bertsky · 2021-11-04T12:02:39Z

I'm missing the word region in the parameters (regions→lines)

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

I agree – this information does not reflect the new or changed processors from the last 2 years. (I believe ocrd-tesserocr-segment-region started out as the only recommendation, then ocrd-tesserocr-segment was added when it became available. But I would not recommend the former anymore, and rather recommend ocrd-eynollah-segment and ocrd-cis-ocropy-segment now.

See also #172

— what about giving grades & est. processing times & memory requirements to processors?

Grades are too simplistic for the diversity of materials (from simple single-column books to multi-column ornamented/illustrated pages and title pages) and problems (region types, region shape complexity, region recursion, reading order, line segmentation in warped/straight imaging, in dense/floating typesetting, in tables).

Processing times and memory requirements, too, may depend on the image resolution and content. But indeed, we should try to provide some guesstimate or experience.

See also OCR-D/ocrd_all#112 and OCR-D/assets#75 (and OCR-D/core#607)

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

— ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

That sentence is part of the paragraph which explains the need for postprocessing when not using all-in-one segmentation or shrink_polygons with Tesseract – so it is necessary there. (No one without minute knowledge of Tesseract internals would understand that dependency.)

Best results for selected pages — workflow

* cis-ocropy-binarize is _not_ recommended(?)

* skimage-binarize is _not_ recommended(?)

* tesserocr-deskew is _not_ recommended(?)

* cis-ocropy-segment is _not_ recommended(?)

#172

jbarth-ubhd · 2021-11-05T10:29:51Z

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

* `segmentation_level` determines the *highest level* to segment. 
   Use `"none"` to disable segmentation altogether, i.e. only recognize existing segments.
* `textequiv_level` determines the *lowest level* to segment. 
   Use `"none"` to segment until the lowest level (`"glyph"`) and disable recognition altogether, only analyse layout.

highest level = something like region and lowest level = something like glyph?

and to segment = to be segmented or to be the result of segmentation?

and none to segment ... disable recognition altogether — recognition of layout or recognition of text? And why only analyse layout — this step is about Region segmentation

Sorry, I'm confused.

bertsky · 2021-11-05T10:51:47Z

highest level = something like region and lowest level = something like glyph?

yes

and to segment = to be segmented or to be the result of segmentation?

the latter

and none to segment ... disable recognition altogether — recognition of layout or recognition of text?

in this paragraph (as in all of our documentation), recognition contrasts with segmentation (and preprocessing and postprocessing), so the latter

And why only analyse layout — this step is about Region segmentation

because this paragraph describes a multi-step processor that can include (text) recognition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows.md, Step 7 #268

workflows.md, Step 7 #268

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

bertsky commented Nov 4, 2021

Best results for selected pages — workflow

jbarth-ubhd commented Nov 5, 2021

bertsky commented Nov 5, 2021

workflows.md, Step 7 #268

workflows.md, Step 7 #268

Comments

jbarth-ubhd commented Nov 4, 2021 • edited Loading

jbarth-ubhd commented Nov 4, 2021 • edited Loading

jbarth-ubhd commented Nov 4, 2021 • edited Loading

jbarth-ubhd commented Nov 4, 2021

jbarth-ubhd commented Nov 4, 2021 • edited Loading

Best results for selected pages — workflow

bertsky commented Nov 4, 2021

Best results for selected pages — workflow

jbarth-ubhd commented Nov 5, 2021

bertsky commented Nov 5, 2021

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading

jbarth-ubhd commented Nov 4, 2021 •

edited

Loading