Skip to content

Commit

Permalink
Merge pull request #411 from jhale1805/non_agpl_epub_extractor
Browse files Browse the repository at this point in the history
Remove EbookLib dependency
  • Loading branch information
Dean Malmgren authored Mar 10, 2022
2 parents 902028f + e81913b commit ac583a1
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 34 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ var/
pip-log.txt
pip-delete-this-directory.txt

# Virtual environments
**/venv*

# Unit test / coverage reports
htmlcov/
.tox/
Expand Down
1 change: 0 additions & 1 deletion requirements/python
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ argcomplete~=1.10.0
beautifulsoup4~=4.8.0
chardet==3.*
docx2txt~=0.8
EbookLib==0.*
extract-msg<=0.29.* #Last with python2 support
pdfminer.six==20191110 #Last with python2 support
python-pptx~=0.6.18
Expand Down
19 changes: 0 additions & 19 deletions tests/epub/raw_text.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@

Epub testing
With subtitle...

Introduction
Welcome here! All the text have ben generate with the Samuel L lorem ipsum.


We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
Expand All @@ -16,7 +12,6 @@ No man, I don't eat pork
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
Is she dead, yes or no?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
Expand All @@ -27,7 +22,6 @@ No man, I don't eat pork
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
Is she dead, yes or no?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
Expand All @@ -38,7 +32,6 @@ No man, I don't eat pork
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
Is she dead, yes or no?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
Expand All @@ -49,18 +42,6 @@ No man, I don't eat pork
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
Is she dead, yes or no?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
The lysine contingency - it's intended to prevent the spread of the animals is case they ever got off the island. Dr. Wu inserted a gene that makes a single faulty enzyme in protein metabolism. The animals can't manufacture the amino acid lysine. Unless they're continually supplied with lysine by us, they'll slip into a coma and die.
Oh... what I'm gon' do?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
No man, I don't eat pork
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
Is she dead, yes or no?
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

We happy?
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
We happy?
Expand Down
56 changes: 42 additions & 14 deletions textract/parsers/epub_parser.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,53 @@
from ebooklib import epub, ITEM_DOCUMENT
import zipfile
from bs4 import BeautifulSoup

from .utils import BaseParser


class Parser(BaseParser):
"""Extract text from epub using python epub library
"""
"""Extract text from epub"""

def extract(self, filename, **kwargs):
book = epub.read_epub(filename)
book = zipfile.ZipFile(filename)
result = ''
for id, _ in book.spine:
item = book.get_item_with_id(id)
# Don't fail with some AttributeError exception when the item is of NoneType
# (i.e. at the last position).
if item is None:
for text_name in self.__epub_sections(book):
if not text_name.endswith("html"):
continue
soup = BeautifulSoup(item.content, 'lxml')
for child in soup.find_all(
['title', 'p', 'div', 'h1', 'h2', 'h3', 'h4']
):
result = result + child.text + '\n'
soup = BeautifulSoup(book.open(text_name), features='lxml')
html_content_tags = ['title', 'p', 'h1', 'h2', 'h3', 'h4']
for child in soup.find_all(html_content_tags):
inner_text = child.text.strip() if child.text else ""
if inner_text:
result += inner_text + '\n'
return result

def __epub_sections(self, book):
opf_paths = self.__get_opf_paths(book)
item_paths = self.__get_item_paths(book, opf_paths)
return item_paths

def __get_opf_paths(self, book):
meta_inf = book.open("META-INF/container.xml")
meta_soup = BeautifulSoup(meta_inf, features='lxml')
return [f["full-path"] for f in meta_soup.rootfiles.find_all("rootfile")]

def __get_item_paths(self, book, opf_paths):
item_paths = []
for opf_path in opf_paths:
opf_soup = BeautifulSoup(book.open(opf_path), "lxml")
epub_items = opf_soup.spine.find_all("itemref")
for epub_item in epub_items:
item = self.__get_item(opf_soup, epub_item["idref"])
item_paths.append(self.__get_full_item_path(book, item["href"]))
return item_paths

def __get_item(self, opf_soup, item_id):
for item in opf_soup.manifest.find_all("item"):
if item["id"] == item_id:
return item
return None

def __get_full_item_path(self, book, partial_path):
for filename in book.namelist():
if filename.endswith(partial_path):
return filename

0 comments on commit ac583a1

Please sign in to comment.