Merge pull request #411 from jhale1805/non_agpl_epub_extractor

Remove EbookLib dependency
deanmalmgren · Mar 10, 2022 · ac583a1 · ac583a1
2 parents 902028f + e81913b
commit ac583a1
Show file tree

Hide file tree

Showing 4 changed files with 45 additions and 34 deletions.
diff --git a/.gitignore b/.gitignore
@@ -25,6 +25,9 @@ var/
 pip-log.txt
 pip-delete-this-directory.txt
 
+# Virtual environments
+**/venv*
+
 # Unit test / coverage reports
 htmlcov/
 .tox/

diff --git a/requirements/python b/requirements/python
@@ -5,7 +5,6 @@ argcomplete~=1.10.0
 beautifulsoup4~=4.8.0
 chardet==3.*
 docx2txt~=0.8
-EbookLib==0.*
 extract-msg<=0.29.* #Last with python2 support
 pdfminer.six==20191110 #Last with python2 support
 python-pptx~=0.6.18

diff --git a/tests/epub/raw_text.txt b/tests/epub/raw_text.txt
@@ -1,11 +1,7 @@
-
 Epub testing
 With subtitle...
-
 Introduction
 Welcome here! All the text have ben generate with the Samuel L lorem ipsum.
-
-
 We happy?
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 We happy?
@@ -16,7 +12,6 @@ No man, I don't eat pork
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 Is she dead, yes or no?
 The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-
 We happy?
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 We happy?
@@ -27,7 +22,6 @@ No man, I don't eat pork
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 Is she dead, yes or no?
 The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-
 We happy?
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 We happy?
@@ -38,7 +32,6 @@ No man, I don't eat pork
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 Is she dead, yes or no?
 The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-
 We happy?
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 We happy?
@@ -49,18 +42,6 @@ No man, I don't eat pork
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 Is she dead, yes or no?
 The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-
-We happy?
-Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
-We happy?
-The lysine contingency - it's intended to prevent the spread of the animals is case they ever got off the island. Dr. Wu inserted a gene that makes a single faulty enzyme in protein metabolism. The animals can't manufacture the amino acid lysine. Unless they're continually supplied with lysine by us, they'll slip into a coma and die.
-Oh... what I'm gon' do?
-The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-No man, I don't eat pork
-Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
-Is she dead, yes or no?
-The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
-
 We happy?
 Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
 We happy?

diff --git a/textract/parsers/epub_parser.py b/textract/parsers/epub_parser.py
@@ -1,25 +1,53 @@
-from ebooklib import epub, ITEM_DOCUMENT
+import zipfile
 from bs4 import BeautifulSoup
 
 from .utils import BaseParser
 
 
 class Parser(BaseParser):
-    """Extract text from epub using python epub library
-    """
+    """Extract text from epub"""
 
     def extract(self, filename, **kwargs):
-        book = epub.read_epub(filename)
+        book = zipfile.ZipFile(filename)
         result = ''
-        for id, _ in book.spine:
-            item = book.get_item_with_id(id)
-            # Don't fail with some AttributeError exception when the item is of NoneType
-            # (i.e. at the last position).
-            if item is None:
+        for text_name in self.__epub_sections(book):
+            if not text_name.endswith("html"):
                 continue
-            soup = BeautifulSoup(item.content, 'lxml')
-            for child in soup.find_all(
-                ['title', 'p', 'div', 'h1', 'h2', 'h3', 'h4']
-            ):
-                result = result + child.text + '\n'
+            soup = BeautifulSoup(book.open(text_name), features='lxml')
+            html_content_tags = ['title', 'p', 'h1', 'h2', 'h3', 'h4']
+            for child in soup.find_all(html_content_tags):
+                inner_text = child.text.strip() if child.text else ""
+                if inner_text:
+                    result += inner_text + '\n'
         return result
+
+    def __epub_sections(self, book):
+        opf_paths = self.__get_opf_paths(book)
+        item_paths = self.__get_item_paths(book, opf_paths)
+        return item_paths
+
+    def __get_opf_paths(self, book):
+        meta_inf = book.open("META-INF/container.xml")
+        meta_soup = BeautifulSoup(meta_inf, features='lxml')
+        return [f["full-path"] for f in meta_soup.rootfiles.find_all("rootfile")]
+
+    def __get_item_paths(self, book, opf_paths):
+        item_paths = []
+        for opf_path in opf_paths:
+            opf_soup = BeautifulSoup(book.open(opf_path), "lxml")
+            epub_items = opf_soup.spine.find_all("itemref")
+            for epub_item in epub_items:
+                item = self.__get_item(opf_soup, epub_item["idref"])
+                item_paths.append(self.__get_full_item_path(book, item["href"]))
+        return item_paths
+
+    def __get_item(self, opf_soup, item_id):
+        for item in opf_soup.manifest.find_all("item"):
+            if item["id"] == item_id:
+                return item
+        return None
+
+    def __get_full_item_path(self, book, partial_path):
+        for filename in book.namelist():
+            if filename.endswith(partial_path):
+                return filename