Extract readable PDFs style information #197

grossir · 2025-01-30T17:51:44Z

We already use pdfplumber to extract a readable PDF's text. If we use page.extract_words in conjunction with page.extract_text it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags <i> and <em> are important.

doctor/doctor/lib/text_extraction.py

Lines 67 to 69 in 9e0e76f

    
           page_text = page.extract_text( 
        
               layout=True, keep_blank_chars=True, y_tolerance=5, y_density=25 
        
           )

For example, to get the italics in the second page

import requests
import pdfplumber
from io import BytesIO

url = "https://storage.courtlistener.com/pdf/2025/01/30/georgia_insurers_insolvency_pool_v._logisticare_solutions_llc.pdf"
r = requests.get(url)
pdf = pdfplumber.open(BytesIO(r.content))
words = pdf.pages[1].extract_words(extra_attrs=["fontname"])

In [33]: [i['text'] for i in pdf.pages[1].extract_words(extra_attrs=["fontname"]) if i['fontname'] == 'TKMUCK+EquityARegular,Italic']
Out[33]: ['Wade', 'v.', 'Allstate', 'Fire', '&', 'Cas.', 'Co.']

Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.

The text was updated successfully, but these errors were encountered:

grossir · 2025-01-31T16:44:56Z

Another example where extracting styles (and linking the reference citations) would greatly improve readability

grossir added this to Case Law Sprint Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract readable PDFs style information #197

Extract readable PDFs style information #197

grossir commented Jan 30, 2025

grossir commented Jan 31, 2025 •

edited

Loading

Extract readable PDFs style information #197

Extract readable PDFs style information #197

Comments

grossir commented Jan 30, 2025

grossir commented Jan 31, 2025 • edited Loading

grossir commented Jan 31, 2025 •

edited

Loading