Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate ext-parameter when utf8 and latin1 representations don't match #27

Open
jeremyspiegel opened this issue Sep 20, 2019 · 14 comments

Comments

@jeremyspiegel
Copy link

This library will only generate the extended filename parameter if the filename is not representable in ISO-8859-1 (checking against NON_LATIN1_REGEXP).

However, some clients will not try to interpret the filename as ISO-8859-1. Chromium's Content-Disposition parsing code tries to interpret it as UTF-8, a referrer_charset, and the native OS default charset in turn. This ends up sometimes working for normal downloads in Chromium since it passes a referrer_charset that defaults to windows-1252. But this doesn't work for some other locales or cases where this referrer_charset isn't passed (like downloads in Electron).

It seems like it would be useful to generate the extended parameter when the utf8 and latin1 representations don't match.

@dougwilson
Copy link
Contributor

Ah, I see. So the default module of this module is to follow the specification, as in just because Chrome does not follow the actual specification does not make the bug in this module, as of course there are other clients who do follow the spec.

What if instead of just generating a longer string when not actually required this module adds an option to allow you to always generate both, for example, so you can decide what to do (maybe sniff the user-agent header and only do it then, or unconditionally if that is what you want)?

@dougwilson
Copy link
Contributor

Another potentional option is that this module could kick in the use of extension parameter if the filename falls outside of US-ASCII instead of ISO-8859-1, though in order to know if that would actually be effective, we would need a much better understanding of what exactly these buggy clients are doing with the bytes in the file name, as not all single byte character schemes match the US-ASCII set for the first 127 characters.

@jeremyspiegel
Copy link
Author

I'll create a Chromium issue as well for this. However, the specification advises to use ASCII rather than ISO-8859-1:

Appendix D.  Advice on Generating Content-Disposition Header Fields

   To successfully interoperate with existing and future user agents,
   senders of the Content-Disposition header field are advised to:
   o  Avoid using non-ASCII characters in the filename parameter.
      Although most existing implementations will decode them as
      ISO-8859-1, some will apply heuristics to detect UTF-8, and thus
      might fail on certain names.

This strategy seems to be followed by Django, Flask, and Rails.

@dougwilson
Copy link
Contributor

Right, probably the reason that is added to the appendix is because there are buggy clients out there who are not following the specification, as they are trying to account for buggy servers who are accidentally generated bad header values in the wrong character set. It is definitely a tricky problem, which is why I generally try to follow the actual standards, since they actually spell out the way to do it.

That is why I also provided the second suggestion above, though. The spec itself clearly spells out that the header is ISO-8859-1 character encoding, regardless of buggy client implementations. So this implementation is not buggy here.

I suggested that maybe we can add options so you can make the decisions you like to better control the behavior that would work better for your case (my first comment). Would that work for you?

@jeremyspiegel
Copy link
Author

I appreciate your quick response, and I understand wanting to avoid sending extra data for misbehaving clients, but I think this module would be better if by default it followed the advice of the spec and the behavior of other popular web server frameworks. I understand if you disagree — an option would also be helpful. Thanks!

@dougwilson
Copy link
Contributor

Right. Of course the RFC has specific wording for the spec, like MUST NOT and SHOULD NOT, etc. The appendix is not even a main point of the spec, relegated to a footnote, without even using the standard wording, so it's hard to draw concrete instructions from it.

So to be clear: I'm not saying I am not going to change anything, and your discussion on this is valuable, so please do not feel dismayed.

@jeremyspiegel
Copy link
Author

I might be missing something, but I don't actually see the spec use MUST, SHOULD, MAY, etc. in regards to using ISO-8859-1. It seems to only mention ISO-8859-1 when talking about the "filename*" parameter, and how it allows characters outside the ISO-8859-1 character set. This does imply that the "filename" parameter can be ISO-8859-1, but it doesn't actually come out and say that anywhere else, so it seems a little ambiguous to me. However, I do think the appendix is clear about the recommendation to avoid non-ASCII characters.

Here is the Chromium bug I filed: https://bugs.chromium.org/p/chromium/issues/detail?id=1006345.

Thanks again for considering a change here!

@dougwilson
Copy link
Contributor

I might be missing something, but I don't actually see the spec use MUST, SHOULD, MAY, etc. in regards to using ISO-8859-1. It seems to only mention ISO-8859-1 when talking about the "filename*" parameter, and how it allows characters outside the ISO-8859-1 character set. This does imply that the "filename" parameter can be ISO-8859-1, but it doesn't actually come out and say that anywhere else, so it seems a little ambiguous to me.

This is because since this spec is in relation to a HTTP header, the spec for HTTP itself is what declares header values as ISO-8859-1, so it only passively mentions it since it assumes that as prior knowledge to the specific Content-Disposition header specification. You can see they try to provide the needed link a few times in the spec, notably in https://tools.ietf.org/html/rfc6266#appendix-C

By default, HTTP header field parameters cannot carry characters outside the ISO-8859-1 ([ISO-8859-1]) character encoding (see [RFC2616], Section 2.2).

However, I do think the appendix is clear about the recommendation to avoid non-ASCII characters.

Yes, I do not disagree here. I was just noting it was relegated to an appendix, when it could have been made an actual rule by either not defining the ABNF of the the header as

     filename-parm       = "filename" "=" value
     value         = <value, defined in [RFC2616], Section 3.6>
                   ; token | quoted-string

in which case the value is defined in RFC2616 as

       value                   = token | quoted-string
       quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
       qdtext         = <any TEXT except <">>

with TEXT being

The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].

       TEXT           = <any OCTET except CTLs,
                        but including LWS>

or annotating in the main part of the spec that TEXT "MUST NOT" contain characters above 0x7f (which, also, they could have just not used the quoted-string ABNF from the RFC2616).

@dougwilson
Copy link
Contributor

Looking around it seems like Firefox has a similar bug open: https://bugzilla.mozilla.org/show_bug.cgi?id=588409

@jeremyspiegel
Copy link
Author

From the new spec:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

I guess Content-Disposition isn't a newly defined field so I'm not sure if this is necessarily relevant, but it's interesting to note.

@dougwilson
Copy link
Contributor

Yea, the CD header existed well before the new specs were written, back in the historical days. The main reason why this module is emitting them was due to users who needed this (back when it was part of expressjs before extracted into it's own module).

Many languages are covered by latin1 and there are clients where when the extended filename parameter is included they fail to parse the header. So back then the compromise was to make it so the full usage of latin1 would result in just the plain filename parameter, allowing things like spanish, french, and more all to generate the header without the extended parameter being there and working for various clients.

There are many more clients out there than web browsers, for example both cURL and wget command line tools parse this header to determine the filename to save to disk.

This is a hard balancing act and so that I don't have to read every client implementation out there, the compromise was fairly straight forward: just follow the spec and since it allows the usage of iso-8859-1 (latin1) in the basic filename parameter it made everyone happy. Apparently for many mant years until now...? Haha

@dougwilson
Copy link
Contributor

dougwilson commented Sep 20, 2019

And, ultimately, nothing in the specs seems to be pointing to Chrome is doing the correct thing. At best, the specs admit that there are buggy clients out there (like Chrome, I guess).

@jeremyspiegel
Copy link
Author

Fair enough, thank you for that explanation.

@jeremyspiegel
Copy link
Author

A Chromium developer has responded to the bug:

I'm skeptical that changing behavior here is worth the investment and dealing with the inevitable breakage, unless we want to reach consensus with other browsers. https://tools.ietf.org/html/rfc7230 pretty clearly bans non-ASCII characters in HTTP headers:

 header-field   = field-name ":" OWS field-value OWS

 field-name     = token
 field-value    = *( field-content / obs-fold )
 field-content  = field-vchar [ 1*( SP / HTAB ) field-vchar ]
 field-vchar    = VCHAR / obs-text

From https://tools.ietf.org/html/rfc5234#appendix-B.1
VCHAR = %x21-7E

So we're dealing with violations of the HTTP spec, and the question is just how to deal with them. (obs-fold is related to line folding, and considered obsolete - regardless, not relevant here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants