How can I globally ignore invalid byte sequences in UTF-8 strings?

后端 未结 5 827
傲寒
傲寒 2021-02-05 08:14

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c

5条回答
  •  时光说笑
    2021-02-05 09:10

    Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü U+00FC = \u0252 (and in UTF-16). It could be an artifact of the Ruby pack/unpack functions. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:

    >> "- Menü -".unpack('U*').pack("U*")
    => "- Menü -"
    

    You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters (C):

    >> "- Menü -".unpack('U*').pack("C*")
    => "- Men\xFC -"
    

    This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):

    >> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
    => "- Menü -"
    

    In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set

    >> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
    => "- Menü -"
    >> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
    => "- Menü -"
    

提交回复
热议问题