hocr syntax errs #4300

bruzzler5 · 2024-08-10T16:51:50Z

Current Behavior

I used tesseract 5.4.1 in WSL/Win10 and tesseract 5.0.1 in GImagereader/Win10 with different image files (fraktur newspaper and latin/Libreoffice dokument, 2 columns, all images in German language), and let the tesseract versions create both OCR-pdf and hocr output. The OCR pdf was ok and searchable and was displayed in PDF viewers with no errors.

After permanently failing to create a searchable OCR pdf with hocr-pdf from the hocr tools I've checked the syntax of the created hocr files with hocr-check and hocr-spec: numerous syntax errs were reported, explaining the failure of hocr-pdf. The created pdf displayed the image only, but didn't contain any text layer (pdftotext produced empty files). The pdf viewer displayed warnings, that the pdf structure is corrupted (streams missing or premature ending of streams)

Expected Behavior

correct syntax of created hocr files, which allows creation of searchable OCR pdfs by hocr-pdf

Suggested Fix

no idea

tesseract -v

tesseract -v (instance in WSL)
tesseract 5.4.1
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.18

Operating System

Windows 10

Other Operating System

see above: 2 different tesseract versions on the same PC:
tesseract 5.4.1 in WSL/Win10 and tesseract 5.0.1 in GImagereader/Win10

uname -a

No response

Compiler

No response

CPU

i5-3570 @3.4 GHz - 8 GB Ram

Virtualization / Containers

WSL/Win10

Other Information

hocr-check let me suspect, that the hocr output from tesseract has syntax errs, which might be responsible for the failure of hocr-pdf

sample hocr files could be supported

stweil · 2024-08-10T18:20:43Z

I produced a hOCR file from hocr-tools/test/testdata/alice_1.png and checked it with hocr-check. No errors were reported.

Please provide more details how the reported issue can be reproduced, ideally the image and the command line which produced the hOCR file with syntax errors.

stweil · 2024-08-11T07:38:06Z

While hocr-check does not complain, hocr-spec reports several errors which occur rather often. Most of them are already discussed in issue #3303: x_size, x_ascender and x_descender are not part of the standard, but extensions introduced by Tesseract. In addition, Tesseract does not add all supported and used capabilities.

stweil · 2024-08-12T07:44:50Z

Meanwhile the capabilities issue was fixed in PR #4301.

The remaining errors which are reported by hocr-spec for x_size, x_ascender and x_descender are not hOCR syntax errors, because the standard allows such implementation-specific parameters. So this is an incomplete implementation of hocr-spec which should not complain.

bruzzler5 changed the title ~~hocr syntac errs~~ hocr syntax errs Aug 10, 2024

This was referenced Aug 11, 2024

Set hOCR capabilities ocrp_dir and ocrp_lang unconditionally #4301

Merged

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hocr syntax errs #4300

hocr syntax errs #4300

bruzzler5 commented Aug 10, 2024

stweil commented Aug 10, 2024

stweil commented Aug 11, 2024 •

edited

Loading

stweil commented Aug 12, 2024

hocr syntax errs #4300

hocr syntax errs #4300

Comments

bruzzler5 commented Aug 10, 2024

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Aug 10, 2024

stweil commented Aug 11, 2024 • edited Loading

stweil commented Aug 12, 2024

stweil commented Aug 11, 2024 •

edited

Loading