-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(epub): Add EPUB support #123
base: main
Are you sure you want to change the base?
Conversation
# Convert content | ||
content_md = [] | ||
h = html2text.HTML2Text() | ||
h.body_width = 0 # Don't wrap lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, could you check if this can use existing HtmlConverter
markitdown/src/markitdown/_markitdown.py
Lines 183 to 223 in cb66b35
class HtmlConverter(DocumentConverter): | |
"""Anything with content type text/html""" | |
def convert( | |
self, local_path: str, **kwargs: Any | |
) -> Union[None, DocumentConverterResult]: | |
# Bail if not html | |
extension = kwargs.get("file_extension", "") | |
if extension.lower() not in [".html", ".htm"]: | |
return None | |
result = None | |
with open(local_path, "rt", encoding="utf-8") as fh: | |
result = self._convert(fh.read()) | |
return result | |
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]: | |
"""Helper function that converts and HTML string.""" | |
# Parse the string | |
soup = BeautifulSoup(html_content, "html.parser") | |
# Remove javascript and style blocks | |
for script in soup(["script", "style"]): | |
script.extract() | |
# Print only the main content | |
body_elm = soup.find("body") | |
webpage_text = "" | |
if body_elm: | |
webpage_text = _CustomMarkdownify().convert_soup(body_elm) | |
else: | |
webpage_text = _CustomMarkdownify().convert_soup(soup) | |
assert isinstance(webpage_text, str) | |
return DocumentConverterResult( | |
title=None if soup.title is None else soup.title.string, | |
text_content=webpage_text, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@0xRaduan love this PR. We already have a dependency for HTML to text (markdownify) in the HTML convertor. Can you check if that would be sufficient? |
Addresses #88.
Adds new converter + new test.