Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows.md, Step 7 #268

Open
jbarth-ubhd opened this issue Nov 4, 2021 · 7 comments
Open

workflows.md, Step 7 #268

jbarth-ubhd opened this issue Nov 4, 2021 · 7 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Nov 4, 2021

Examples:
* To segment existing regions into lines (and only lines) only: 
    `segmentation_level="line"`, `textequiv_level="line"`, `model=""`
* To segment existing regions into lines (and only lines) and recognize text:
    `segmentation_level="line"`, `textequiv_level="line"`, `model="Fraktur"`

I'm missing the word region in the parameters (regions→lines)

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Nov 4, 2021

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

— what about giving grades & est. processing times & memory requirements to processors?

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Nov 4, 2021

PS: ocrd-tesserocr-segment* (recommended) are not in the »Best results for selected pages« workflow. (see below)

@jbarth-ubhd
Copy link
Author

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Nov 4, 2021

Best results for selected pages — workflow

  • cis-ocropy-binarize is not recommended(?)
  • skimage-binarize is not recommended(?)
  • tesserocr-deskew is not recommended(?)
  • cis-ocropy-segment is not recommended(?)

@bertsky
Copy link
Collaborator

bertsky commented Nov 4, 2021

I'm missing the word region in the parameters (regions→lines)

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow)

I agree – this information does not reflect the new or changed processors from the last 2 years. (I believe ocrd-tesserocr-segment-region started out as the only recommendation, then ocrd-tesserocr-segment was added when it became available. But I would not recommend the former anymore, and rather recommend ocrd-eynollah-segment and ocrd-cis-ocropy-segment now.

See also #172

— what about giving grades & est. processing times & memory requirements to processors?

Grades are too simplistic for the diversity of materials (from simple single-column books to multi-column ornamented/illustrated pages and title pages) and problems (region types, region shape complexity, region recursion, reading order, line segmentation in warped/straight imaging, in dense/floating typesetting, in tables).

Processing times and memory requirements, too, may depend on the image resolution and content. But indeed, we should try to provide some guesstimate or experience.

See also OCR-D/ocrd_all#112 and OCR-D/assets#75 (and OCR-D/core#607)

I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).«

ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above.

That sentence is part of the paragraph which explains the need for postprocessing when not using all-in-one segmentation or shrink_polygons with Tesseract – so it is necessary there. (No one without minute knowledge of Tesseract internals would understand that dependency.)

Best results for selected pages — workflow

* cis-ocropy-binarize is _not_ recommended(?)

* skimage-binarize is _not_ recommended(?)

* tesserocr-deskew is _not_ recommended(?)

* cis-ocropy-segment is _not_ recommended(?)

#172

@jbarth-ubhd
Copy link
Author

Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.

* `segmentation_level` determines the *highest level* to segment. 
   Use `"none"` to disable segmentation altogether, i.e. only recognize existing segments.
* `textequiv_level` determines the *lowest level* to segment. 
   Use `"none"` to segment until the lowest level (`"glyph"`) and disable recognition altogether, only analyse layout.

highest level = something like region and lowest level = something like glyph?

and to segment = to be segmented or to be the result of segmentation?

and none to segment ... disable recognition altogetherrecognition of layout or recognition of text? And why only analyse layout — this step is about Region segmentation

Sorry, I'm confused.

@bertsky
Copy link
Collaborator

bertsky commented Nov 5, 2021

highest level = something like region and lowest level = something like glyph?

yes

and to segment = to be segmented or to be the result of segmentation?

the latter

and none to segment ... disable recognition altogetherrecognition of layout or recognition of text?

in this paragraph (as in all of our documentation), recognition contrasts with segmentation (and preprocessing and postprocessing), so the latter

And why only analyse layout — this step is about Region segmentation

because this paragraph describes a multi-step processor that can include (text) recognition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants