How to support Chinese in http request body?

后端 未结 1 1182
心在旅途
心在旅途 2021-01-25 06:43
URL = http://example.com,
Header = [],
Type = \"application/json\",
Content = \"我是中文\",

Body = lists:concat([\"{\\\"type\\\":\\\"0\\\",\\\"result\\\":[{\\\"url\\\":\\\"         


        
1条回答
  •  梦毁少年i
    2021-01-25 07:04

    Luck of the Encoding

    You must take special care to ensure input is what you think it is because it may differ from what you expect.

    This answer applies to the Erlang release that I'm running which is R16B03-1. I'll try to get all of the details in here so you can test with your own install and verify.

    If you don't take specific action to change it, a string will be interpreted as follows:

    In the Terminal (OS X 10.9.2)

    TerminalContent = "我是中文",
    TerminalContent = [25105,26159,20013,25991].
    

    In the terminal the string is interpreted as a list of unicode characters.

    In a Module

    BytewiseContent = "我是中文",
    BytewiseContent = [230,136,145,230,152,175,228,184,173,230,150,135].
    

    In a module, the default encoding is latin1 and strings containing unicode characters are interpreted bytewise lists (of UTF8 bytes).

    If you use data encoded like BytewiseContent, unicode:characters_to_list/1 will double-encode the Chinese characters and ææ¯ä will be sent to the server where you expected 我是中文.

    Solution

    1. Specify the encoding for each source file and term file.
    2. If you run an erl command line, ensure it is setup to use unicode.
    3. If you read data from files, translate the bytes from the bytewise encoding to unicode before processing (this goes for binary data acquired using httpc:request/N as well).

    If you embed unicode characters in your module, ensure that you indicate as much by commenting within the first two lines of your module:

    %% -*- coding: utf-8 -*-
    

    This will change the way the module interprets the string such that:

    UnicodeContent = "我是中文",
    UnicodeContent = [25105,26159,20013,25991].
    

    Once you have ensured that you are concatenating characters and not bytes, the concatenation is safe. Don't use unicode:characters_to_list/1 to convert your string/list until the whole thing has been built up.

    Example Code

    The following function works as expected when given a Url and a list of unicode character Content:

    http_post_content(Url, Content) ->
        ContentType = "application/json",
        %% Concat the list of (character) lists
        Body = lists:concat(["{\"content\":\"", Content, "\"}"]),
        %% Explicitly encode to UTF8 before sending
        UnicodeBin = unicode:characters_to_binary(Body),
        httpc:request(post,
            {
                Url,
                [],          % HTTP headers
                ContentType, % content-type
                UnicodeBin   % the body as binary (UTF8)
                },
            [],            % HTTP Options
            [{body_format,binary}] % indicate the body is already binary
            ).
    

    To verify results I wrote the following HTTP server using node.js and express. The sole purpose of this dead-simple server is to sanity check the problem and solution.

    var express = require('express'),
    bodyParser = require('body-parser'),
    util = require('util');
    
    var app = express();
    
    app.use(bodyParser());
    
    app.get('/', function(req, res){
      res.send('You probably want to perform an HTTP POST');
    });
    
    app.post('/', function(req, res){
      util.log("body: "+util.inspect(req.body, false, 99));
      res.json(req.body);
    });
    
    app.listen(3000);
    

    Gist

    Verifying

    Again in Erlang, the following function will check to ensure that the HTTP response contains the echoed JSON, and ensures the exact unicode characters were returned.

    verify_response({ok, {{_, 200, _}, _, Response}}, SentContent) ->
        %% use jiffy to decode the JSON response
        {Props} = jiffy:decode(Response),
        %% pull out the "content" property value
        ContentBin = proplists:get_value(<<"content">>, Props),
        %% convert the binary value to unicode characters,
        %% it should equal what we sent.
        case unicode:characters_to_list(ContentBin) of
            SentContent -> ok;
            Other ->
                {error, [
                    {expected, SentContent},
                    {received, Other}
                    ]}
        end;
    verify_response(Unexpected, _) ->
        {error, {http_request_failed, Unexpected}}.
    

    The complete example.erl module is posted in a Gist.

    Once you've got the example module compiled and an echo server running you'll want to run something like this in an Erlang shell:

    inets:start().
    
    Url = example:url().
    
    Content = example:content().
    
    Response = example:http_post_content(Url, Content).
    

    If you've got jiffy set up you can also verify the content made the round trip:

    example:verify_response(Response, Content).
    

    You should now be able to confirm round-trip encoding of any unicode content.

    Translating Between Encodings

    While I explained the encodings above you will have noticed that TerminalContent, BytewiseContent, and UnicodeContent are all lists of integers. You should endeavor to code in a manner that allows you to be certain what you have in hand.

    The oddball encoding is bytewise which may turn up when working with modules that are not "unicode aware". Erlang's guidance on working with unicode mentions this near the bottom under the heading Lists of UTF-8 Bytes. To translate bytewise lists use:

    %% from http://www.erlang.org/doc/apps/stdlib/unicode_usage.html
    utf8_list_to_string(StrangeList) ->
        unicode:characters_to_list(list_to_binary(StrangeList)).
    

    My Setup

    As far as I know, I don't have local settings that modify Erlang's behavior. My Erlang is R16B03-1 built and distributed by Erlang Solutions, my machine runs OS X 10.9.2.

    0 讨论(0)
提交回复
热议问题