Replies: 1 comment 2 replies
-
This is an issue with the underlying OCR library not supporting this particular file. Regarding the fillable form: PDF is a pretty wild format, and some hidden elements in that document might appear as if they are forms. Please open the logs, and set the filter to debug. There should be a related line that says "Calling OCRmyPDF with ..." or similar. Please post that. This is also part of #246. I'm figuring out a better workflow to support more file types. Edit. If you want to help me figure out #246 a little more, set Would you rather have paperless:
|
Beta Was this translation helpful? Give feedback.
-
Hi Jonas,
thank you for providing paperless-ng! It is an awesome project. However, my paperless-ng installation's consumer (1.0, docker, debian 10) fails to ocr a scanned pdf.
System:
Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-13-amd64 Architecture: x86-64
Docker:
Docker version 20.10.2, build 2291f61
Docker-compose:
docker-compose version 1.27.4, build 40524192
Steps to reproduce the problem:
Expected behavior:
Successfull OCR of document
Error log of failed task:
`: Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse
ocrmypdf.ocr(**ocr_args)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline
validate_pdfinfo_options(context)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options
raise InputFileError()
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse
raise ParseError(e)
documents.parsers.ParseError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError`
Error log of webserver when running docker-compose up:
webserver_1 | ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files. webserver_1 | ERROR 2021-01-28 22:58:34,114 loggers Error while consuming document Scan_1_28012021_003254.pdf: webserver_1 | 22:58:34 [Q] ERROR Failed [Scan_1_28012021_003254.pdf] - : Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse webserver_1 | ocrmypdf.ocr(**ocr_args) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr webserver_1 | return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline webserver_1 | validate_pdfinfo_options(context) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options webserver_1 | raise InputFileError() webserver_1 | ocrmypdf.exceptions.InputFileError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file webserver_1 | document_parser.parse(self.path, mime_type, self.filename) webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse webserver_1 | raise ParseError(e) webserver_1 | documents.parsers.ParseError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker webserver_1 | res = f(*task["args"], **task["kwargs"]) webserver_1 | File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file webserver_1 | override_tag_ids=override_tag_ids) webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file webserver_1 | raise ConsumerError(e) webserver_1 | documents.consumer.ConsumerError
I noticed the line:
ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files.
The PDF definetely has no user fillable form, it is a non-ocr scan of a printed document.
Steps i tried:
Do you have any ideas on this?
Edit: Manually adding the pdf via uploader throws the same error
Beta Was this translation helpful? Give feedback.
All reactions