You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.
I have issue with a specific kind of pdf that return me no data: device.get_result() result have the _objs list empty.
I have been using the code below for quite a while, but this kind of pdf is empty. This pdf doesn't seem to be only image as I can use acrobat reader to convert it into text.
I was wondering if there is something I can change regarding laparams or the PDFRessourceManager.
Please let me know your email and I can send you the pdf, I don't want to post it here.
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
pdf_name = 'MyPdf.pdf'
document = open(pdf_name, 'rb')
pdf_str = ''
# Create resource manager
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
# laparams = LAParams()
laparams = LAParams(
detect_vertical=True, all_texts = True
)
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(document)
for page in pages:
interpreter.process_page(page)
# receive the LTPage object for the page.
pdf_extract = device.get_result()
layout = device.get_result()
pdf_str = pdf_str + ''.join(
element.get_text().lower() for element in layout if isinstance(element, LTTextBoxHorizontal))
print(pdf_str)
The text was updated successfully, but these errors were encountered:
Sorry some more points maybe: I tried different combinations of LAParams values. None return me something.
The values I get now in device.result which is my LTPage object:
Hi,
I have issue with a specific kind of pdf that return me no data: device.get_result() result have the _objs list empty.
I have been using the code below for quite a while, but this kind of pdf is empty. This pdf doesn't seem to be only image as I can use acrobat reader to convert it into text.
I was wondering if there is something I can change regarding
laparams
or thePDFRessourceManager
.Please let me know your email and I can send you the pdf, I don't want to post it here.
The text was updated successfully, but these errors were encountered: