Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

WesselKroos · 2025-01-17T16:36:13Z

Are you experiencing an issue with...

shields.io

🐞 Description

When an html document starts with a lowercased <!doctype> tag the following error is thrown:
Not well-formed XML starting with "<!"

🔗 Link to the badge

Failing badges on domains with lowercased doctypes

Badge url: https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Fwww.google.com%2F&query=%2F%2Fdiv%5B%40id%3D%22gws-output-pages-elements-homepage_additional_languages__als%22%5D%2Fdiv
Url: https://www.google.com

Badge:
https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Faddons.opera.com%2Fen%2Fextensions%2Fdetails%2Fyoutube-ambilight%2F&query=%2F%2Fdt%5Bcontains(text()%2C%27Downloads%27)%5D%2Ffollowing-sibling%3A%3Add%5B1%5D
Url: https://addons.opera.com/en/extensions/details/youtube-ambilight/

Working badge on a domain with uppercased doctype

Badge: https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Fwww.bbc.com%2F&query=%2F%2Fdiv%5B%40data-testid%3D%22social-follow-us-container%22%5D%2Fh2
Url: https://www.bbc.com/

💡 Possible Solution

Transform <!doctype> to <!DOCTYPE> before parsing the document with xmldom?

The text was updated successfully, but these errors were encountered:

chris48s · 2025-01-17T19:43:40Z

As a first step, have you considered raising an issue upstream on https://github.com/xmldom/xmldom and suggest doctype should be case insensitive when mimeType=text/html ?

xmldom's goal is to be a ponyfill for window.DOMParser If I parse a document like this using DOMParser in a browser, I get back an HTMLDocument instance.

WesselKroos · 2025-01-17T22:17:55Z

@chris48s Thanks for the context. I didn't know for sure if it was just a xml or also a html parser.
This seems to be a regression that was introduced in @xmldom/[email protected]. I've now also created a bugreport in that repository: xmldom/xmldom#817

karfau · 2025-01-19T08:23:39Z

@chris48s can you hep me understand, is this code always parsing "dynamic xml" with the mime type text/html?

The reason is that the HTML spec only allows <!DOCTYPE html> (case insensitive) or <!DOCTYPE html SYSTEM 'about:legacy-compat'> or <!DOCTYPE html SYSTEM "about:legacy-compat"> (case insensitive outside of the single or double quotes).

I assume it is very likely there are also XHTML pages out there with more advanced doctypes.
So if xmldom would fix this by only allowing what is in the specs, those pages could no longer be parsed using the html mime type.

If that's the case I will need to provide a solution that sticks to the "lax parsing habits" of the html mime type.

chris48s · 2025-01-19T09:22:28Z

We look at the Content-Type header of the requested document.

If it includes text/html then we call parseFromString() with mimeType text/html. Otherwise, we assume text/xml.

shields/services/dynamic/dynamic-xml.service.js

Lines 138 to 144 in 152b8e9

    
           let contentType = 'text/xml' 
        
           if ( 
        
             res.headers['content-type'] && 
        
             res.headers['content-type'].includes('text/html') 
        
           ) { 
        
             contentType = 'text/html' 
        
           }

I would expect most of the time a proper XHTML document would be served with the content type header application/xhtml+xml, but obviously the responses we're going to receive will vary.

If part of the fix here is for us to be a bit more sophisticated about parsing the header and passing mime type information to parseFromString(), let me know.

Just as a bit of context: The primary purpose of this integration is to read data from XML documents, although we do try to play nice with HTML as far as possible.

karfau · 2025-01-19T11:17:02Z

That's great news. I will anyways support other doctypes in html, just to be sure.

BTW: xmldom exports isValidMimeType, which could be helpful in this case:
If it returns true you can use the value from the header as is. (It also returns false for undefined).
I can create a PR if you want.

chris48s · 2025-01-19T12:21:37Z

If you're up for submitting a PR, that would be great. A couple of notes:

A Content-Type header may contain character set information as well as just a mime type. For example, we should be able to correctly deal with a header like:

Content-Type: text/html; charset=utf-8

This is why we currently use .includes('text/html')

If the content type header is missing or invalid, lets fall back to assuming text/xml rather than throw an error/refuse to parse. The premise of this badge is that we're expecting an XML document.

jNullj · 2025-01-24T21:07:01Z

Solved at #10842 that bumbed xmldom to 0.9.7 which introduced a fix.
a working example can be seen at #10845

WesselKroos added the question Support questions, usage questions, unconfirmed bugs, discussions, ideas label Jan 17, 2025

karfau mentioned this issue Jan 19, 2025

pass matching mime type to xmldom; test [dynamicxml] #10830

Merged

chris48s mentioned this issue Jan 24, 2025

add tests for dynamic xml with lowercase doctype #10845

Merged

jNullj closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

WesselKroos commented Jan 17, 2025

chris48s commented Jan 17, 2025

WesselKroos commented Jan 17, 2025

karfau commented Jan 19, 2025

chris48s commented Jan 19, 2025 •

edited

Loading

karfau commented Jan 19, 2025 •

edited

Loading

chris48s commented Jan 19, 2025

jNullj commented Jan 24, 2025

Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

Comments

WesselKroos commented Jan 17, 2025

Are you experiencing an issue with...

🐞 Description

🔗 Link to the badge

Failing badges on domains with lowercased doctypes

Working badge on a domain with uppercased doctype

💡 Possible Solution

chris48s commented Jan 17, 2025

WesselKroos commented Jan 17, 2025

karfau commented Jan 19, 2025

chris48s commented Jan 19, 2025 • edited Loading

karfau commented Jan 19, 2025 • edited Loading

chris48s commented Jan 19, 2025

jNullj commented Jan 24, 2025

chris48s commented Jan 19, 2025 •

edited

Loading

karfau commented Jan 19, 2025 •

edited

Loading