You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The HtmlTable.from_html_text() method drops text content that follows <br/> tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.
To Reproduce
fromunstructured.common.html_tableimportHtmlTablehtml_text="""<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>"""table=HtmlTable.from_html_text(html_text)
print(table.html)
Output:
<table><tr><td>This is 1st line.<br/><br/></td></tr></table>
Expected Output:
<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>
Expected behavior
The text content following <br/> tags should be preserved during HTML normalization. Currently, the tail text of <br/> elements is being removed, which results in loss of content.
Screenshots
No screenshots.
Environment Info
unstructured version: 0.16.17
Python version: 3.11
OS: MacOS
Additional context
It is possible that the issue could be resolved by modifying the from_html_text() method to preserve the tail text of <br/> tags while normalising whitespace.
Describe the bug
The
HtmlTable.from_html_text()
method drops text content that follows<br/>
tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.To Reproduce
Output:
Expected Output:
Expected behavior
The text content following
<br/>
tags should be preserved during HTML normalization. Currently, thetail
text of<br/>
elements is being removed, which results in loss of content.Screenshots
No screenshots.
Environment Info
Additional context
It is possible that the issue could be resolved by modifying the
from_html_text()
method to preserve the tail text of<br/>
tags while normalising whitespace.The text was updated successfully, but these errors were encountered: