multipart/form-data, what is the default charset for fields?

僤鯓⒐⒋嵵緔 提交于 2019-12-01 03:09:46

The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.

3.7.1 Canonicalization and Text Defaults

--snip--

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.

So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.

If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?

I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.

Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.

Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.

As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".

This is my understanding of the state of things. I welcome any corrections to my assumptions!

Thanks to the detailed explanation by @owlman.

Just some more info here:

Upload request payload fragment:

------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain

If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.

I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.

I did some googling:

https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209@kumachan.net.nz%3E

It seems that Resin's behavior is according to Servlet Spec 2.3

And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp which can change this behavior for Resin.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!