How to use Ruby's readlines.grep for UTF-16 files?

后端未结

关注

 2  502

Given the following two files created by the following commands:

$ printf \"foo\\nbar\\nbaz\\n\" | iconv -t UTF-8 > utf-8.txt
$ printf \"foo\\nbar\\nbaz\\


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-12-21 22:49
              
            
            
                                                                       
Short answer:

You almost have it, just need to say which characters you want to replace (I would guess the invalid and the undefined):

$ ruby -e 'puts File.open("utf-16.txt", "r").read.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")'
foo
bar
baz


Also I don't think you need force_encoding.

If you want to ignore the BOM convert on open and use readlines you can use:

 ruby -e 'puts File.open("utf-16.txt", mode: "rb:BOM|UTF-16LE:UTF-8").readlines.grep(/foo/)'


More details:

The reason why you get invalid characters when you do:

$ruby -e 'puts File.open("utf-16.txt", "r").read.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)'
ÿþfoo
bar
baz


is that in the beginning of each file which is in Unicode you can have the Byte Order Mark which shows the byte order and the encoding form. In your case it is FE FF (meaning Little-endian UTF-16), which are invalid UTF-8 characters.

You can verify that by invoking encode without force_encoding:

$ruby -e 'puts File.open("utf-16.txt", "r").read.encode("utf-8")'
��foo
bar
baz


Question marks in black box are used to replace an unknown, unrecognized or unrepresentable character.

You can check more on BOM here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-21 22:51
              
            
            
                                                                       
While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE into UTF-8 is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:

puts File.open(
  "utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
  Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#⇒ foo

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复