Force encode from US-ASCII to UTF-8 (iconv)

岁酱吖の 提交于 2019-11-27 10:33:42

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

Short Answer

  • file only guesses at the file encoding and may be wrong (especially in cases where special characters only appear late in large files).
  • you can use hexdump to look at bytes of non-7-bit-ascii text and compare against code tables for common encodings (iso-8859-*, utf-8) to decide for yourself what the encoding is.
  • iconv will use whatever input/output encoding you specify regardless of what the contents of the file are. If you specify the wrong input encoding the output will be garbled.
  • even after running iconv, file may not report any change due to the limited way in which file attempts to guess at the encoding. For a specific example, see my long answer.
  • 7-bit ascii (aka us-ascii) is identical at a byte level to utf-8 and the 8-bit ascii extensions (iso-8859-*). So if your file only has 7-bit characters, then you can call it utf-8, iso-8859-* or us-ascii because at a byte level they are all identical. It only makes sense to talk about utf-8 and other encodings (in this context) once your file has characters outside the 7-bit ascii range.

Long Answer

I ran into this today and came across your question. Perhaps I can add a little more information to help other people who run into this issue.

First, the term ASCII is overloaded, and that leads to confusion.

7-bit ASCII only includes 128 characters (00-7F or 0-127 in decimal). 7-bit ASCII is also referred to as US-ASCII.

https://en.wikipedia.org/wiki/ASCII

UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters. So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII.

https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

The term extended ascii (or high ascii) refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters, plus additional characters.

https://en.wikipedia.org/wiki/Extended_ASCII

ISO-8859-1 (aka "ISO Latin 1") is a specific 8-bit ASCII extension standard that covers most characters for Western Europe. There are other ISO standards for Eastern European languages and Cyrillic languages. ISO-8859-1 includes characters like Ö, é, ñ and ß for German and Spanish. "Extension" means that ISO-8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit. So for the first 128 characters, it is equivalent at a byte level to ASCII and UTF-8 encoded files. However, when you start dealing with characters beyond the first 128, your are no longer UTF-8 equivalent at the byte level, and you must do a conversion if you want your "extended ascii" file to be UTF-8 encoded.

https://en.wikipedia.org/wiki/Extended_ASCII#ISO_8859_and_proprietary_adaptations

One lesson I learned today is that we can't trust file to always give correct interpretation of a file's character encoding.

https://en.wikipedia.org/wiki/File_%28command%29

The command tells only what the file looks like, not what it is (in the case where file looks at the content). It is easy to fool the program by putting a magic number into a file the content of which does not match it. Thus the command is not usable as a security tool other than in specific situations.

file looks for magic numbers in the file that hint at the type, but these can be wrong, no guarantee of correctness. file also tries to guess the character encoding by looking at the bytes in the file. Basically file has a series of tests that helps it guess at the file type and encoding.

My file is a large CSV file. file reports this file as us-ascii encoded, which is WRONG.

$ ls -lh
total 850832
-rw-r--r--  1 mattp  staff   415M Mar 14 16:38 source-file
$ file -b --mime-type source-file
text/plain
$ file -b --mime-encoding source-file
us-ascii

My file has umlauts in it (ie Ö). The first non-7-bit-ascii doesn't show up until over 100k lines into the file. I suspect this is why file doesn't realize the file encoding isn't US-ASCII.

$ pcregrep -no '[^\x00-\x7F]' source-file | head -n1
102321:�

I'm on a mac, so using PCRE's grep. With gnu grep you could use the -P option. Alternatively on a mac, one could install coreutils (via homebrew or other) in order to get gnu grep.

I haven't dug into the source-code of file, and the man page doesn't discuss the text encoding detection in detail, but I am guessing file doesn't look at the whole file before guessing encoding.

Whatever my file's encoding is, these non-7-bit-ASCII characters break stuff. My German CSV file is ;-separated and extracting a single column doesn't work.

$ cut -d";" -f1 source-file > tmp
cut: stdin: Illegal byte sequence
$ wc -l *
 3081673 source-file
  102320 tmp
 3183993 total

Note the cut error and that my "tmp" file has only 102320 lines with the first special character on line 102321.

Let's take a look at how these non-ASCII characters are encoded. I dump the first non-7-bit-ascii into hexdump, do a little formatting, remove the newlines (0a) and take just the first few.

$ pcregrep -o '[^\x00-\x7F]' source-file | head -n1 | hexdump -v -e '1/1 "%02x\n"'
d6
0a

Another way. I know the first non-7-bit-ASCII char is at position 85 on line 102321. I grab that line and tell hexdump to take the two bytes starting at position 85. You can see the special (non-7-bit-ASCII) character represented by a ".", and the next byte is "M"... so this is a single-byte character encoding.

$ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2
00000055  d6 4d                                             |.M|
00000057

In both cases, we see the special character is represented by d6. Since this character is an Ö which is a German letter, I am guessing that ISO-8859-1 should include this. Sure enough you can see "d6" is a match (https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout).

Important question... how do I know this character is an Ö without being sure of the file encoding? Answer is context. I opened the file, read the text and then determined what character it is supposed to be. If I open it in vim it displays as an Ö because vim does a better job of guessing the character encoding (in this case) than file does.

So, my file seems to be ISO-8859-1. In theory I should check the rest of the non-7-bit-ASCII characters to make sure ISO-8859-1 is a good fit... There is nothing that forces a program to only use a single encoding when writing a file to disk (other than good manners).

I'll skip the check and move on to conversion step.

$ iconv -f iso-8859-1 -t utf8 source-file > output-file
$ file -b --mime-encoding output-file
us-ascii

Hmm. file still tells me this file is US-ASCII even after conversion. Let's check with hexdump again.

$ tail -n +102321 output-file | head -n1 | hexdump -C -s85 -n2
00000055  c3 96                                             |..|
00000057

Definitely a change. Note that we have two bytes of non-7-bit-ASCII (represented by the "." on the right) and the hex code for the two bytes is now c3 96. If we take a look, seems we have UTF-8 now (c3 96 is the right encoding of Ö in UTF-8) http://www.utf8-chartable.de/

But file still reports our file as us-ascii? Well, I think this goes back to the point about file not looking at the whole file and the fact that the first non-7-bit-ASCII characters don't occur until deep in the file.

I'll use sed to stick a Ö at the beginning of the file and see what happens.

$ sed '1s/^/Ö\'$'\n/' source-file > test-file
$ head -n1 test-file
Ö
$ head -n1 test-file | hexdump -C
00000000  c3 96 0a                                          |...|
00000003

Cool, we have an umlaut. Note the encoding though is c3 96 (utf-8). Hmm.

Checking our other umlauts in the same file again:

$ tail -n +102322 test-file | head -n1 | hexdump -C -s85 -n2
00000055  d6 4d                                             |.M|
00000057

ISO-8859-1. Oops! Just goes to show how easy it is to get the encodings screwed up.

Let's try converting our new test file with the umlaut at the front and see what happens.

$ iconv -f iso-8859-1 -t utf8 test-file > test-file-converted
$ head -n1 test-file-converted | hexdump -C
00000000  c3 83 c2 96 0a                                    |.....|
00000005
$ tail -n +102322 test-file-converted | head -n1 | hexdump -C -s85 -n2
00000055  c3 96                                             |..|
00000057

Oops. That first umlaut that was UTF-8 was interpreted as ISO-8859-1 since that is what we told iconv. The second umlaut is correctly converted from d6 to c3 96.

I'll try again, this time I will use vim to do the Ö insertion instead of sed. vim seemed to detect the encoding better (as "latin1" aka ISO-8859-1) so perhaps it will insert the new Ö with a consistent encoding.

$ vim source-file
$ head -n1 test-file-2
�
$ head -n1 test-file-2 | hexdump -C
00000000  d6 0d 0a                                          |...|
00000003
$ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2
00000055  d6 4d                                             |.M|
00000057

Looks good. Looks like ISO-8859-1 for new and old umlauts.

Now the test.

$ file -b --mime-encoding test-file-2
iso-8859-1
$ iconv -f iso-8859-1 -t utf8 test-file-2 > test-file-2-converted
$ file -b --mime-encoding test-file-2-converted
utf-8

Boom! Moral of the story. Don't trust file to always guess your encoding right. Easy to mix encodings within the same file. When in doubt, look at the hex.

A hack (also prone to failure) that would address this specific limitation of file when dealing with large files would be to shorten the file to make sure that special characters appear early in the file so file is more likely to find them.

$ first_special=$(pcregrep -o1 -n '()[^\x00-\x7F]' source-file | head -n1 | cut -d":" -f1)
$ tail -n +$first_special source-file > /tmp/source-file-shorter
$ file -b --mime-encoding /tmp/source-file-shorter
iso-8859-1

Update

Christos Zoulas updated file to make the amount of bytes looked at configurable. One day turn-around on the feature request, awesome!

http://bugs.gw.com/view.php?id=533 https://github.com/file/file/commit/d04de269e0b06ccd0a7d1bf4974fed1d75be7d9e

The feature was released in file version 5.26.

Looking at more of a large file before making a guess about encoding takes time. However it is nice to have the option for specific use-cases where a better guess may outweigh additional time/io.

Use the following option:

−P, −−parameter name=value

    Set various parameter limits.

    Name    Default     Explanation
    bytes   1048576     max number of bytes to read from file

Something like...

file_to_check="myfile"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check

...should do the trick if you want to force file to look at the whole file before making a guess. Of course this only works if you have file 5.26 or newer.

I haven't built/tested the latest releases yet. Most of my machines currently have file 5.04 (2010)... hopefully someday this release will make it down from upstream.

Mathieu

So people say you can't and I understand you may be frustrated when asking a question and getting such an answer.

If you really want it to show in utf-8 instead of us-ascii then you need to do it in 2 steps.

first :

iconv -f us-ascii -t utf-16 yourfile > youfileinutf16.*

second:

iconv -f utf-16le -t utf-8 yourfileinutf16 > yourfileinutf8.*

then if you do a file -i you'll see the new charset is utf-8.

Hope it helps.

sarnold

I think Ned's got the core of the problem -- your files are not actually ASCII. Try

iconv -f ISO-8859-1 -t UTF-8 file.php > file-utf8.php

I'm just guessing that you're actually using iso-8859-1, it is popular with most European languages.

suther

There is no difference between US-ASCII and UTF-8, so no need to reconvert it. But here a little hint, if you have trouble with special-chars while recodeing.

Add //TRANSLIT after the source-charset-Parameter.

Example:

iconv -f ISO-8859-1//TRANSLIT -t UTF-8 filename.sql > utf8-filename.sql

This helps me on strange types of quotes, which are allways broke the charset reencode process.

Here's a script that will find all files matching a pattern you pass it, then converting them from their current file encoding to utf-8. If the encoding is us-ascii, then it will still show as us-ascii, since that is a subset of utf-8.

#!/usr/bin/env bash    
find . -name "${1}" |
    while read line;
    do
        echo "***************************"
        echo "Converting ${line}"

        encoding=$(file -b --mime-encoding ${line}) 
        echo "Found Encoding: ${encoding}"

        iconv -f "${encoding}" -t "utf-8" ${line} -o ${line}.tmp
        mv ${line}.tmp ${line}
    done

You can use file -i file_name to check what exactly your original file format is.

Once you get that, you can do the following:

iconv -f old_format -t utf-8 input_file -o output_file

I accidentally encoded a file in UTF-7 and had a similar issue. When I typed file -i name.file I would get charset=us-ascii. iconv -f us-ascii -t utf-9//translit name.file would not work since I've gathered UTF-7 is a subset of us-ascii, as is UTF-8.

To solve this I entered: iconv -f UTF-7 -t UTF-8//TRANSLIT name.file -o output.file

I'm not sure how to determine the encoding other than what others have suggested here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!