Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to render a dynamic xml badge for webpages with a lowercased <!doctype> #10827

Closed
WesselKroos opened this issue Jan 17, 2025 · 7 comments
Closed
Labels
question Support questions, usage questions, unconfirmed bugs, discussions, ideas

Comments

@WesselKroos
Copy link

Are you experiencing an issue with...

shields.io

🐞 Description

When an html document starts with a lowercased <!doctype> tag the following error is thrown:
Not well-formed XML starting with "<!"

🔗 Link to the badge

Failing badges on domains with lowercased doctypes

Image
Badge url: https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Fwww.google.com%2F&query=%2F%2Fdiv%5B%40id%3D%22gws-output-pages-elements-homepage_additional_languages__als%22%5D%2Fdiv
Url: https://www.google.com

Image
Badge:
https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Faddons.opera.com%2Fen%2Fextensions%2Fdetails%2Fyoutube-ambilight%2F&query=%2F%2Fdt%5Bcontains(text()%2C%27Downloads%27)%5D%2Ffollowing-sibling%3A%3Add%5B1%5D
Url: https://addons.opera.com/en/extensions/details/youtube-ambilight/

Working badge on a domain with uppercased doctype

Image
Badge: https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Fwww.bbc.com%2F&query=%2F%2Fdiv%5B%40data-testid%3D%22social-follow-us-container%22%5D%2Fh2
Url: https://www.bbc.com/

💡 Possible Solution

Transform <!doctype> to <!DOCTYPE> before parsing the document with xmldom?

@WesselKroos WesselKroos added the question Support questions, usage questions, unconfirmed bugs, discussions, ideas label Jan 17, 2025
@chris48s
Copy link
Member

As a first step, have you considered raising an issue upstream on https://github.com/xmldom/xmldom and suggest doctype should be case insensitive when mimeType=text/html ?

xmldom's goal is to be a ponyfill for window.DOMParser If I parse a document like this using DOMParser in a browser, I get back an HTMLDocument instance.

@WesselKroos
Copy link
Author

@chris48s Thanks for the context. I didn't know for sure if it was just a xml or also a html parser.
This seems to be a regression that was introduced in @xmldom/[email protected]. I've now also created a bugreport in that repository: xmldom/xmldom#817

@karfau
Copy link
Contributor

karfau commented Jan 19, 2025

@chris48s can you hep me understand, is this code always parsing "dynamic xml" with the mime type text/html?

The reason is that the HTML spec only allows <!DOCTYPE html> (case insensitive) or <!DOCTYPE html SYSTEM 'about:legacy-compat'> or <!DOCTYPE html SYSTEM "about:legacy-compat"> (case insensitive outside of the single or double quotes).

I assume it is very likely there are also XHTML pages out there with more advanced doctypes.
So if xmldom would fix this by only allowing what is in the specs, those pages could no longer be parsed using the html mime type.

If that's the case I will need to provide a solution that sticks to the "lax parsing habits" of the html mime type.

@chris48s
Copy link
Member

chris48s commented Jan 19, 2025

We look at the Content-Type header of the requested document.

If it includes text/html then we call parseFromString() with mimeType text/html. Otherwise, we assume text/xml.

let contentType = 'text/xml'
if (
res.headers['content-type'] &&
res.headers['content-type'].includes('text/html')
) {
contentType = 'text/html'
}

I would expect most of the time a proper XHTML document would be served with the content type header application/xhtml+xml, but obviously the responses we're going to receive will vary.

If part of the fix here is for us to be a bit more sophisticated about parsing the header and passing mime type information to parseFromString(), let me know.

Just as a bit of context: The primary purpose of this integration is to read data from XML documents, although we do try to play nice with HTML as far as possible.

@karfau
Copy link
Contributor

karfau commented Jan 19, 2025

That's great news. I will anyways support other doctypes in html, just to be sure.

BTW: xmldom exports isValidMimeType, which could be helpful in this case:
If it returns true you can use the value from the header as is. (It also returns false for undefined).
I can create a PR if you want.

@chris48s
Copy link
Member

If you're up for submitting a PR, that would be great. A couple of notes:

  1. A Content-Type header may contain character set information as well as just a mime type. For example, we should be able to correctly deal with a header like:

Content-Type: text/html; charset=utf-8

This is why we currently use .includes('text/html')

  1. If the content type header is missing or invalid, lets fall back to assuming text/xml rather than throw an error/refuse to parse. The premise of this badge is that we're expecting an XML document.

@jNullj
Copy link
Member

jNullj commented Jan 24, 2025

Solved at #10842 that bumbed xmldom to 0.9.7 which introduced a fix.
a working example can be seen at #10845

@jNullj jNullj closed this as completed Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Support questions, usage questions, unconfirmed bugs, discussions, ideas
Projects
None yet
Development

No branches or pull requests

4 participants