html being decoded and triggering as side effect. #25

mbret · 2024-12-02T21:02:25Z

I have an epub with this content:

<section class="sect2" title="Figures">
					<h3 class="title" id="_figures">Figures</h3>
					<p>Coming up for a quick breath of fresh air before descending into another
						accessibility attribute pain point, HTML5 introduces the handy new <code
							class="literal">figure</code> element for encapsulating content
						associated with an image, table, or code example. Grouping related content
						elements together, as is becoming an old theme now, makes it simpler for a
						reader to navigate and understand your content:</p>
					<pre class="screen">&lt;figure&gt;
    &lt;img src="images/blob.jpeg" alt="the blob"/&gt;
    &lt;figcaption&gt;
        Figure 3.7 &amp;#x2014; The blob is digesting Steve McQueen in this
        unreleased ending to the classic movie.
    &lt;/figcaption&gt;
&lt;/figure&gt;</pre>
					<p>Unfortunately, there is little support for these two new elements at this
						time, so they get treated as no better than <code class="literal">div</code>
						elements. That said, it’s still preferable to future-proof your data and do
						the right thing, as support will catch up, especially since the only other
						alternative is semantically meaningless <code class="literal">div</code>
						elements.</p>
				</section>

The problematic is the figure section. When the method decodeEntities is called with the text node, the el.innerHTML = str instruction will actually load and run the html code that was in the figure. Unintentionally trigger http requests and other wort things.

Is it an actual bug or this side effect is expected and we should not work with live document ?

In the meantime I was trying to find a safer alternative and here what I came up with:

decodeEntities(dom: Document, str: string) {
    try {
      const parser = new DOMParser()
      const doc = parser.parseFromString(
        `<!DOCTYPE html><text>${str}</text>`,
        "text/html",
      )
      return doc.querySelector("text")?.textContent || str
    } catch (err) {
      // TODO fall back to simpler decode?
      // e.g. regex match for stuff like &#160; and &nbsp;
      return str
    }
  }

It does not crash and seems to be returning the correct node and CFI. I am not super confident regarding performance of using DOMParser vs the same document where we create an new temporary element

The text was updated successfully, but these errors were encountered:

mbret · 2024-12-02T21:18:07Z

Okay so another alternative I just though which could be faster is to create a temporary dom like this:

const newDom = document.implementation.createHTMLDocument()
      const el = newDom.createElement(`textarea`)
      el.innerHTML = str
      return el.value || ``

Since it's not the current dom, it does not trigger html resources. However I believe we could initialize it on the constructor and destroy it once the CFI is resolved, that could increase performances.

Additionally, why is this decoding needed to begin with ? Why isn't the string just enough ?

Juul · 2024-12-03T11:21:22Z

The problematic is the figure section. When the method decodeEntities is called with the text node, the el.innerHTML = str instruction will actually load and run the html code that was in the figure. Unintentionally trigger http requests and other wort things.

Wow nice find!

Is it an actual bug or this side effect is expected and we should not work with live document ?

Yeah this is definitely a bug.

I don't have time to look at this right now but I'll try to squeeze it in some time this week. Do ping me again if I forget to get back to you. Feel free to make a pull request with your proposed solution and thanks for reporting this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html being decoded and triggering as side effect. #25

html being decoded and triggering as side effect. #25

mbret commented Dec 2, 2024

mbret commented Dec 2, 2024 •

edited

Loading

Juul commented Dec 3, 2024

html being decoded and triggering as side effect. #25

html being decoded and triggering as side effect. #25

Comments

mbret commented Dec 2, 2024

mbret commented Dec 2, 2024 • edited Loading

Juul commented Dec 3, 2024

mbret commented Dec 2, 2024 •

edited

Loading