Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/br tag tail text loss #3899

Open
K-Oxon opened this issue Feb 3, 2025 · 0 comments
Open

bug/br tag tail text loss #3899

K-Oxon opened this issue Feb 3, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@K-Oxon
Copy link

K-Oxon commented Feb 3, 2025

Describe the bug
The HtmlTable.from_html_text() method drops text content that follows <br/> tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.

To Reproduce

from unstructured.common.html_table import HtmlTable
html_text = """
<table>
<tr>
<td>This is 1st line.<br/>2nd line.<br/>3rd line.</td>
</tr>
</table>
"""
table = HtmlTable.from_html_text(html_text)
print(table.html)

Output:

<table><tr><td>This is 1st line.<br/><br/></td></tr></table>

Expected Output:

<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>

Expected behavior

The text content following <br/> tags should be preserved during HTML normalization. Currently, the tail text of <br/> elements is being removed, which results in loss of content.

Screenshots
No screenshots.

Environment Info

  • unstructured version: 0.16.17
  • Python version: 3.11
  • OS: MacOS

Additional context
It is possible that the issue could be resolved by modifying the from_html_text() method to preserve the tail text of <br/> tags while normalising whitespace.

class HtmlTable:
    ...
    @classmethod
    def from_html_text(cls, html_text: str) -> 'CustomHtmlTable':
            ...
            # -- normalize br tag tail text
            if e.tag == "br":
                if e.tail:
                    e.tail = " ".join(e.tail.split())
            else:
                # -- remove tails for non-br elements
                if e.tail:
                    e.tail = None
            ...
@K-Oxon K-Oxon added the bug Something isn't working label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant