You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We already use pdfplumber to extract a readable PDF's text. If we use page.extract_words in conjunction with page.extract_text it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags <i> and <em> are important.
Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.
The text was updated successfully, but these errors were encountered:
We already use
pdfplumber
to extract a readable PDF's text. If we usepage.extract_words
in conjunction withpage.extract_text
it will sometimes be able to extract style information. This may improve readability and help on issues like freelawproject/eyecite#198 where style tags<i>
and<em>
are important.doctor/doctor/lib/text_extraction.py
Lines 67 to 69 in 9e0e76f
For example, to get the italics in the second page
Then, the italicized words can be resolved to the extracted main text. We would probably need to prebuild a list of courts where this is possible, and filter what styles we want to be preserved.
The text was updated successfully, but these errors were encountered: