Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow search+replace #680

Open
bruzzler5 opened this issue Aug 10, 2024 · 1 comment
Open

very slow search+replace #680

bruzzler5 opened this issue Aug 10, 2024 · 1 comment

Comments

@bruzzler5
Copy link

I do OCR german fraktur newspapers. A pdf , size 1-2 GB, contains approx. 1000 pages. OCR is done overnight (i5-3570), that's ok. But the search+replace of 3 specific fraktur chars (fraktur-s, fraktur-hyphen and fraktur-log hyphen ) to the corresponding normal chars lasts eternally: after 4 hr I aborted the program.

It seems, that the used algorithm isnt efficient enough for long files, and could/should be improved. BTW, if I use a bypass (exporting the hocr, replacing the 3 chars e.g. by a perl script, and re-importing the corrected hocr), this is done in minutes.

BTW: any chance to implement a pdf export format, in which the image format remains unchanged? Any format change results in much larger files.

@bruzzler5
Copy link
Author

I did some tests trying to workaround the problem. This was possible by first exporting the hocr file, running a perl script doing the search&replace in the hocr file, and then re-importing the resulting hocr file. These procedures just needed a very small fraction of the time compared with the existing routine for search+replace, AND the exported pdf contained the replaced chars/strings!

I didn't check which algorithm is implemented in gImageReader. But, at least for huge numbers of images, IMHO a algorithm as described above, I'd suggest for trying to implement.

BTW, this workaround shows, that gImageReader is - obviously thx to the podofo library - already able to create readable pdfs from a hocr file and images - a task many programs fail to do, eg hocr-tools and many other. I'd suggest to promote this feature after thorough testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant