Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract readable PDFs style information #197

Open
grossir opened this issue Jan 30, 2025 · 1 comment
Open

Extract readable PDFs style information #197

grossir opened this issue Jan 30, 2025 · 1 comment

Comments

@grossir
Copy link
Contributor

grossir commented Jan 30, 2025

We already use pdfplumber to extract a readable PDF's text. If we use page.extract_words in conjunction with page.extract_text it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags <i> and <em> are important.

page_text = page.extract_text(
layout=True, keep_blank_chars=True, y_tolerance=5, y_density=25
)

For example, to get the italics in the second page

Image

import requests
import pdfplumber
from io import BytesIO

url = "https://storage.courtlistener.com/pdf/2025/01/30/georgia_insurers_insolvency_pool_v._logisticare_solutions_llc.pdf"
r = requests.get(url)
pdf = pdfplumber.open(BytesIO(r.content))
words = pdf.pages[1].extract_words(extra_attrs=["fontname"])

In [33]: [i['text'] for i in pdf.pages[1].extract_words(extra_attrs=["fontname"]) if i['fontname'] == 'TKMUCK+EquityARegular,Italic']
Out[33]: ['Wade', 'v.', 'Allstate', 'Fire', '&', 'Cas.', 'Co.']

Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.

@grossir
Copy link
Contributor Author

grossir commented Jan 31, 2025

Another example where extracting styles (and linking the reference citations) would greatly improve readability

Image
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant