-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: safe mode #19
Comments
Ack. This is needed. The open question is what exactly it should do. My current blurry idea includes these points:
Opinions? Other ideas? |
It would be nice to be able to whitelist some HTML tags, like Snudown (Reddit’s non-conformant parser) can. But that would be hard. On the other hand, MD4C already must be able to parse out a tag for inline HTML, and in safe mode we can simply disallow anything that isn’t strictly valid. (Checking for basic syntax correctness isn’t that hard. The hard part in parsing HTML is handling invalid markup. But we can simply reject invalid markup.) |
Whitelisting inline (correctly formed) tags is fine. But the raw HTML block with its start and end conditions is quite different story. So to be sure, you propose to treat the raw HTML blocks in the safe mode differently and recognize them only if they are a sequence of complete and valid (whitelisted) tags. I.e. treat them almost the same way as paragraph composed of nothing but raw HTML inlines and the only difference would be the block is not enclosed in I'll think about it. |
@mity I was thinking of ignoring HTML blocks entirely, which MD4C can already do. Much simpler and less prone to security vulnerabilities, both memory safety violations and whitelist validation failures. And HTML blocks are not needed for most applications that want a safe mode. |
Ugh. That means that this:
would be translated to this:
|
Yeah, that's not good. Fortunately we can white list that kind of HTML
block.
…On Jul 14, 2017 7:46 PM, "Martin Mitáš" ***@***.***> wrote:
Ugh. That means that this:
<div>
Paragraph 1.
Paragraph 2.
</div>
would be translated to this:
<p><div></p>
<p>Paragraph 1.</p>
<p>Paragraph 2.</p>
<p></div></p>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGGWB4hUkB52AMRzWDxDKuGm8oK8SYeuks5sN_3BgaJpZM4OYdD9>
.
|
I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-) The open problems:
|
I disagree. There will certainly be safe cases that are rejected, but
detecting an HTML tag in "canonical form" (that that serializers emit) is
not any harder than recognizing an XML tag. In fact, one can do both(!)
with regular expressions. The difficulties appear when one needs to handle
HTML that is invalid, or that isn't in canonical form.
That said, a proper HTML parser is necessary if one wants to ensure that
markup is valid, unless one wants to implement most of an XML parser. For
many use-cases, this isn't a problem.
…On Jul 22, 2017 9:54 PM, "Craig Barnes" ***@***.***> wrote:
It seems to me like there are only 2 truly "safe" ways to handle inline
HTML.
1. Disallow HTML tags entirely and escape angle brackets as < and
>.
2. Allow HTML as usual and post-process the output through a proper
HTML sanitizer.
The number of ways to trip up anything short of a full HTML5 parser are
too many.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGGWB5tBhB_foMijg643h5Q_8I5M2yltks5sQqftgaJpZM4OYdD9>
.
|
We definitely don't want to allow comments – IE executes scripts in
conditional comments. I think it is best to strip them out.
Processing instructions and CDATA sections should just be unrecognized. I
don't understand why CommonMark recognizes them; they are parse errors in
HTML and thus rather useless. I think that the only HTML that should be
allowed in safe mode should be complete, whitelisted, and well-formed
tags. Anything else should not be allowed.
…On Jul 18, 2017 5:34 PM, "Martin Mitáš" ***@***.***> wrote:
I tried to implement something as discussed above. It is very incomplete
but I already can see this approach has some big problems. So I consider it
to be just a proof of concept which shows its severe limitations and that
real solution needs more worked then I thought. (As always ;-)
The open problems:
-
Raw HTML blocks: Currently not handled at all.
-
filter_url_scheme() callback: Is it the right way? Maybe the app might
want to see whole link destination and not just the scheme. (Consider e.g.
relative links without any scheme. App might still want to forbid them if
they start e.g. with .. or '/'). That way, the interface could be
unified with the attribute callback (see below.) On the flip side, app then
must parse the scheme on its own if that's what it wants to filter.
-
Tag attributes: Should not be there also filter_raw_html_tag_attribute(
)? Consider e.g. all those onclick, onload and similar HTML attributes.
-
Markdown construct versus semantically equivalent raw HTML: Consider
e.g. [innocent text](dangerous_url) versus <a
href="dangerous_url">innocent text</a>. Should not this be somehow
unified in the filtering interface? Maybe, for the filtering purposes, the
Markdown construct should be translated to raw html tag/attribute firthe
purposes of the filtering, so app does not have a chance to screw it up by
forbidding one and allowing the 2nd or vice versa?
-
HTML comments or CDATA;? Do we want to support them as well in the
safe mode? I tend to say no and to keep it relatively simple.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGGWB-n2Eel6U_-QMepwm_Ann1iT2HRyks5sPST2gaJpZM4OYdD9>
.
|
Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered: <div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')"> |
Yeah. I have realized that too (see the comment with open comments). I am in process of thinking how the interface of the filtering should look like. |
Right now, I tend to discard the attempt above and start again. As said, I am mostly wondering how the interface should look like. I am especially interested in a feedback for the following questions:
|
The only cases where you need to know the element name, attribute name and attribute value at the same time are The large majority of attributes suitable for whitelisting are global attributes (applicable/safe for all elements), where the value is implicitly safe (doesn't need to be known by the filter, doesn't depend on the presence of other attributes). I've implemented a minimal HTML5 DOM sanitizer in Lua here. The safeness of the code is dependant on the correctness of the underlying HTML5 parser, but the set of whitelisting rules should be similar to what you might want to use here.
Just that attribute should be omitted. I can't think of any case where removing a non-whitelisted attribute from a whitelisted element would make it unsafe. |
This is a feature request for a “safe mode”: potentially-malicious content (such as certain URL schemes) is disallowed to prevent XSS attacks.
The output from this should be safely insertable into a webpage, without any further escaping or sanitization.
The text was updated successfully, but these errors were encountered: