When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:
a-16.strings: Little-endian UTF-16 Unicod
This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.
The Problem:
I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt
. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file
command on myfile.txt
and got the following
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
If the file was compliant, here is what should have been the conversation:
$ file myfile.txt
myfile.txt: ASCII text, with very long lines
The Solution: To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.
First convert to Big Endian at the same encoding via vi
(or vim
). vi myfile.txt
. In vi
do :set fileencoding=UTF-16BE
then write out the file. You may have to force it with :!wq
.
vi myfile.txt
(which should now be in utf-16BE). In vi
do :set fileencoding=ASCII
then write out the file. Again, you may have to force the write with !wq
.
Run dos2unix
converter: d2u myfile.txt
. If you now run file myfile.txt
you should now see an output or something more familiar and assuring like:
myfile.txt: ASCII text, with very long lines
That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt
. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed
or the like. Cheers.