Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(epub): Add EPUB support #123

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

0xRaduan
Copy link

Addresses #88.

Adds new converter + new test.

# Convert content
content_md = []
h = html2text.HTML2Text()
h.body_width = 0 # Don't wrap lines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, could you check if this can use existing HtmlConverter

class HtmlConverter(DocumentConverter):
"""Anything with content type text/html"""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not html
extension = kwargs.get("file_extension", "")
if extension.lower() not in [".html", ".htm"]:
return None
result = None
with open(local_path, "rt", encoding="utf-8") as fh:
result = self._convert(fh.read())
return result
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
"""Helper function that converts and HTML string."""
# Parse the string
soup = BeautifulSoup(html_content, "html.parser")
# Remove javascript and style blocks
for script in soup(["script", "style"]):
script.extract()
# Print only the main content
body_elm = soup.find("body")
webpage_text = ""
if body_elm:
webpage_text = _CustomMarkdownify().convert_soup(body_elm)
else:
webpage_text = _CustomMarkdownify().convert_soup(soup)
assert isinstance(webpage_text, str)
return DocumentConverterResult(
title=None if soup.title is None else soup.title.string,
text_content=webpage_text,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gagb
Copy link
Contributor

gagb commented Dec 20, 2024

@0xRaduan love this PR. We already have a dependency for HTML to text (markdownify) in the HTML convertor. Can you check if that would be sufficient?

@gagb gagb added the awaiting op response The PR is awaiting response/edits from the original poster. label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting op response The PR is awaiting response/edits from the original poster.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants