feat(epub): Add EPUB support #123

0xRaduan · 2024-12-18T09:58:57Z

Addresses #88.

Adds new converter + new test.

l-lumin · 2024-12-19T04:00:04Z

src/markitdown/_markitdown.py

+        # Convert content
+        content_md = []
+        h = html2text.HTML2Text()
+        h.body_width = 0  # Don't wrap lines


hi, could you check if this can use existing HtmlConverter

markitdown/src/markitdown/_markitdown.py

Lines 183 to 223 in cb66b35

class HtmlConverter(DocumentConverter):

"""Anything with content type text/html"""

def convert(

self, local_path: str, **kwargs: Any

) -> Union[None, DocumentConverterResult]:

# Bail if not html

extension = kwargs.get("file_extension", "")

if extension.lower() not in [".html", ".htm"]:

return None

result = None

with open(local_path, "rt", encoding="utf-8") as fh:

result = self._convert(fh.read())

return result

def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:

"""Helper function that converts and HTML string."""

# Parse the string

soup = BeautifulSoup(html_content, "html.parser")

# Remove javascript and style blocks

for script in soup(["script", "style"]):

script.extract()

# Print only the main content

body_elm = soup.find("body")

webpage_text = ""

if body_elm:

webpage_text = _CustomMarkdownify().convert_soup(body_elm)

else:

webpage_text = _CustomMarkdownify().convert_soup(soup)

assert isinstance(webpage_text, str)

return DocumentConverterResult(

title=None if soup.title is None else soup.title.string,

text_content=webpage_text,

)

gagb · 2024-12-20T22:51:55Z

@0xRaduan love this PR. We already have a dependency for HTML to text (markdownify) in the HTML convertor. Can you check if that would be sufficient?

0xRaduan and others added 3 commits December 18, 2024 10:58

feat(epub): Add EPUB support

cd6058e

add new dependencies

98f1cdb

Merge branch 'main' into add-epub-support

33104e8

l-lumin reviewed Dec 19, 2024

View reviewed changes

gagb added the awaiting op response The PR is awaiting response/edits from the original poster. label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(epub): Add EPUB support #123

feat(epub): Add EPUB support #123

0xRaduan commented Dec 18, 2024

l-lumin Dec 19, 2024

gagb Dec 20, 2024

gagb commented Dec 20, 2024

	class HtmlConverter(DocumentConverter):
	"""Anything with content type text/html"""

	def convert(
	self, local_path: str, **kwargs: Any
	) -> Union[None, DocumentConverterResult]:
	# Bail if not html
	extension = kwargs.get("file_extension", "")
	if extension.lower() not in [".html", ".htm"]:
	return None

	result = None
	with open(local_path, "rt", encoding="utf-8") as fh:
	result = self._convert(fh.read())

	return result

	def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
	"""Helper function that converts and HTML string."""

	# Parse the string
	soup = BeautifulSoup(html_content, "html.parser")

	# Remove javascript and style blocks
	for script in soup(["script", "style"]):
	script.extract()

	# Print only the main content
	body_elm = soup.find("body")
	webpage_text = ""
	if body_elm:
	webpage_text = _CustomMarkdownify().convert_soup(body_elm)
	else:
	webpage_text = _CustomMarkdownify().convert_soup(soup)

	assert isinstance(webpage_text, str)

	return DocumentConverterResult(
	title=None if soup.title is None else soup.title.string,
	text_content=webpage_text,
	)

feat(epub): Add EPUB support #123

Are you sure you want to change the base?

feat(epub): Add EPUB support #123

Conversation

0xRaduan commented Dec 18, 2024

l-lumin Dec 19, 2024

Choose a reason for hiding this comment

gagb Dec 20, 2024

Choose a reason for hiding this comment

gagb commented Dec 20, 2024