How can I render UTF-16BE in command line?

问题

I often come across a string representing UTF-16BE, such as \u0444\u0430\u0439\u043b, which would be properly rendered as файл.

I wonder: is there a simple way to "render" a text file in UTF-16BE (or simply an input string in in UTF-16BE) such as the one above by using sed or other command line tool?

回答1:

Assuming the text is actually encoded in UTF-16BE (and not, as you show in your question, as an ASCII string containing backslash and 'u' characters), you can use the iconv command.

Assuming your locale is set to handle UTF-8 output:

iconv -f utf-16be -t utf-8 [input-file]

EDIT :

Based on your comments, what you have is not UTF-16BE at all; it's apparently plain ASCII, encoding Unicode code points using the \u.... syntax. This is not a format that iconv recognizes (as far as I know).

You should edit your question, removing any references to UTF-16BE and explaining more accurately what data you actually have, and what you want to do with it. Where did these strings come from? Are they stored in a text file, or did they come from some other source (say, the output of some program)? Does the input consist entirely of \u...., or is it mixed with other data? And are your locale settings configured to display UTF-8 properly?

If you have a string containing "\u0444\u0430\u0439\u043b" (that's 24 ASCII characters), then the printf command should work -- if you use a sufficiently recent version of printf.

printf is both a shell built-in and an external command, /usr/bin/printf, part of the GNU coreutils package.

The following works on my system:

$ s='\u0444\u0430\u0439\u043b'
$ printf "$s\n"
файл

Or you can use the %b format (this is specific to the printf command; C's printf() function doesn't do this), which interprets backslash escapes in argument strings (normally they're only interpreted in the format string):

$ printf "%b\n" "$s"
файл

On another system, with an older version of bash, the printf builtin doesn't recognize \u escapes -- but /usr/bin/printf does. It appears that the coreutils printf command gained support for \u escapes earlier than bash did.

$ s='\u0444\u0430\u0439\u043b'
$ printf "$s\n"
\u0444\u0430\u0439\u043b
$ printf "%b\n" "$s"
\u0444\u0430\u0439\u043b
$ /usr/bin/printf "$s\n"
файл
$ /usr/bin/printf "%b\n" "$s"
файл

All of this assumes you have the '\u0444\u0430\u0439\u043b' string in a variable. If it's in a file, you could slurp the file contents into a shell variable, probably a line at a time, but it's not the best solution. In that case, this Perl script should do the job; it copies its input to stdout, replacing \u.... sequences with the corresponding Unicode character, encoded in UTF-8; the input can be either one or more files named on the command line, or standard input if it's invoked with no arguments.

#!/usr/bin/perl

use strict;
use warnings;

use utf8;
binmode(STDOUT, ":utf8");

while (<>) {
    s/\\u([\da-fA-F]{4})/chr(hex($1))/eg;
    print;
}

Again, please edit your question so it reflects your actual problem and drops any references to UTF-16BE.

回答2:

Simply do:

echo -e "\u0444\u0430\u0439\u043b"

Notice you may need to set your env variable LANG to utf-8:

export LANG="en_US.UTF-8"

As pointed by Keith Thompson, it may be even better to use printf; so, you'd have:

printf "\u0444\u0430\u0439\u043b"

And for the two above options, the output is:

файл

来源：https://stackoverflow.com/questions/14078659/how-can-i-render-utf-16be-in-command-line

标签

Ubuntu

unicode

sed

command-line-interface