I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c
Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü U+00FC = \u0252
(and in UTF-16). It could be an artifact of the Ruby pack/unpack functions. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:
>> "- Menü -".unpack('U*').pack("U*")
=> "- Menü -"
You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters (C):
>> "- Menü -".unpack('U*').pack("C*")
=> "- Men\xFC -"
This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):
>> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
=> "- Menü -"
In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set
>> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
=> "- Menü -"
>> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
=> "- Menü -"