Convert UTF8 to UTF16 using iconv

后端未结

关注

 3  1744

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings:    Little-endian UTF-16 Unicod


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-02-03 22:46
              
            
            
                                                                       
UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark).  Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either.  But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM.  Is that correct?  As far as I can tell, iconv won't do that directly.  But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE


The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2021-02-03 22:48
              
            
            
                                                                       
This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.

The Problem: 
I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file command on myfile.txt and got the following

$ file myfile.txt

myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators


If the file was compliant, here is what should have been the conversation:

$ file myfile.txt

myfile.txt: ASCII text, with very long lines


The Solution:
To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.


First convert to Big Endian at the same encoding via vi (or vim). vi myfile.txt. In vi do :set fileencoding=UTF-16BE then write out the file. You may have to force it with :!wq.
vi myfile.txt (which should now be in utf-16BE). In vi do :set fileencoding=ASCII then write out the file. Again, you may have to force the write with !wq.
Run dos2unix converter: d2u myfile.txt. If you now run file myfile.txt you should now see an output or something more familiar and assuring like:

myfile.txt: ASCII text, with very long lines



That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed or the like. Cheers.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-02-03 23:07
              
            
            
                                                                       
I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16 doesn't define endianness, we must use file to determine whether it's UTF-16BE or UTF-16LE. Finally, we can convert to UTF-16LE.

iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复