-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong normalization of message with HTML #125
Comments
@moaminsharifi, the underlying problem here is that I thought we could normalize a translation like msgid "foo bar"
msgstr "FOO BAR" by running We can throw in as much custom logic as we want in the normalization tool: I consider it a bit of free space where we clean up our past problems 🙂 However, it would be nice if we could find a clean way to keep the inline HTML intact in general. @dyoo, you've run into similar problems with the parsing of HTML comments. Do you have an idea of how we can stop splitting out the inline HTML? |
This is related to #97 as well: when we extract messages from a Markdown string, we get multiple messages if the Markdown contains HTML. So foo <span>bar</span>" becomes Since HTML is very rare in my experience, it would probably better to not split it out like that. So maybe we should just let |
With the pulldown-cmark upgrade the span tags are detected as The change to
There are two ways this could go: With the statement above that HTML is rare in these translations I would go for solution a). Any thoughts on this @mgeisler? |
Thanks for looking at this!
I would also lean towards solution a). The less opinionated we have to be the better in my opinion 😄 The amount of HTML allowed in the Markdown is ultimately a decision for the people who use the tooling here, so I don't think we need to guard against JavaScript tags or similar.
Great! Just for my own curiosity, does a HTML comment like this in its own paragraph show up as A related question would be what happens to Markdown wrapped in a large <div class="warning">
Beware of the dog!
</div> I haven't tested it, but I expect us to extract |
A HTML comment in its own paragraph is still showing up as HtmlBlock so there is no change: see this link But if it is used inline it is detected as an Inline Block. To give a bigger example:
The pull request has further details and examples on this. |
This is from google/comprehensive-rust#1471: running
mdbook-i18n-normalize
on a translated message that contain HTML gives the wrong output.I can reproduce this with
This tests normalizing a catalog looking like this:
The test fails with
which tells us that the normalized catalog contain two messages:
In other words, the original message was split into two messages.
In case only the translation contain HTML, the normalization will end up with a different number of messages: 1 for the
msgid
field, and 2 for themsgstr
field. This is seen as an error, so a fallback kicks in: the normalized message is marked fuzzy and we accumulate the "left-over" messages into the final message.This behavior can be seen in this unit test in
normalize.rs
:This is what happened in google/comprehensive-rust#1471 where we transform
into
The text was updated successfully, but these errors were encountered: