Ruby String.encode still gives “invalid byte sequence in UTF-8”

后端 未结 3 1235
说谎
说谎 2020-12-28 10:28

In IRB, I\'m trying the following:

1.9.3p194 :001 > foo = \"\\xBF\".encode(\"utf-8\", :invalid => :replace, :undef => :replace)
 => \"\\xBF\" 
1.         


        
相关标签:
3条回答
  • 2020-12-28 11:06

    If you're only working with ascii characters you can use

    >> "Hello \xBF World!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
    => "Hello � World!"
    

    But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii

    >> "¡Hace \xBF mucho frío!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
    => "��Hace � mucho fr��o!"
    

    Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters

    >> "¡Hace \xBF mucho frío!".chars.select{|i| i.valid_encoding?}.join
    => "¡Hace  mucho frío!"
    

    Also in Ruby 2.1 there is a new method called scrub that solves this problem

    >> "¡Hace \xBF mucho frío!".scrub
    => "¡Hace � mucho frío!"
    >> "¡Hace \xBF mucho frío!".scrub('')
    => "¡Hace  mucho frío!"
    
    0 讨论(0)
  • 2020-12-28 11:22

    I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:

    >> s = "\xBF"
    => "\xBF"
    >> s.encoding
    => #<Encoding:UTF-8>
    

    \xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:

    encode(dst_encoding, src_encoding [, options] ) → str

    [...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.

    You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:

    >> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
    => "�"
    

    Where s is the "\xBF" that thinks it is UTF-8 from above.

    You could also use force_encoding on s to force it to be binary and then use the two-argument encode:

    >> s.encoding
    => #<Encoding:UTF-8>
    >> s.force_encoding('binary')
    => "\xBF"
    >> s.encoding
    => #<Encoding:ASCII-8BIT>
    >> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
    => "�"
    
    0 讨论(0)
  • 2020-12-28 11:28

    This is fixed if you read the source text file in using an explicit code page:

    File.open( 'thefile.txt', 'r:iso8859-1' )
    
    0 讨论(0)
提交回复
热议问题