How can I globally ignore invalid byte sequences in UTF-8 strings?

后端 未结 5 805
傲寒
傲寒 2021-02-05 08:14

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c

相关标签:
5条回答
  • 2021-02-05 08:58

    In ruby 2.0 you could use the String#b method, that is a short alias to String#force_encoding("BINARY")

    0 讨论(0)
  • 2021-02-05 09:00

    If you can configure your database/page/whatever to give you strings in ASCII-8BIT, this will get you their real encoding.

    Use Ruby's stdlib encoding guessing library. Pass all your strings through something like this:

    require 'nkf'
    str = "- Men\xFC -"
    str.force_encoding(NKF.guess(str))
    

    The NKF library will guess the encoding (usually successfully), and force that encoding on the string. If you don't feel like trusting the NKF library totally, build this safeguard around string operations too:

    begin
      str.split
    rescue ArgumentError
      str.force_encoding('BINARY')
      retry
    end
    

    This will fallback on BINARY if NKF didn't guess correctly. You can turn this into a method wrapper:

    def str_op(s)
      begin
        yield s
      rescue ArgumentError
        s.force_encoding('BINARY')
        retry
      end
    end
    
    0 讨论(0)
  • 2021-02-05 09:06

    If you just want to operate on the raw bytes, you can try encoding it as ASCII-8BIT/BINARY.

    str.force_encoding("BINARY").split("n")
    

    This isn't going to get your ü back, though, since your source string in this case is ISO-8859-1 (or something like it):

    "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
     => "- Menü -"
    

    If you want to get multibyte characters, you have to know what the source charset is. Once you force_encoding to BINARY, you're going to literally just have the raw bytes, so multibyte characters won't be interpreted accordingly.

    If the data is from your database, you can change your connection mechanism to use an ASCII-8BIT or BINARY encoding; Ruby should flag them accordingly then. Alternately, you can monkeypatch the database driver to force encoding on all strings read from it. This is a massive hammer, though, and might be the absolutely wrong thing to do.

    The right answer is going to be to fix your string encodings. This may require a database fix, a database driver connection encoding fix, or some combination thereof. All the bytes are still there, but if you're dealing with a given charset, you should, if at all possible, let Ruby know that you expect your data to be in that encoding. A common mistake is to use the mysql2 driver to connect to a MySQL database which has data in latin1 encodings, but to specify a utf-8 charset for the connection. This causes Rails to take the latin1 data from the DB and interpret it as utf-8, rather than interpreting it as latin1 which you can then convert to UTF-8.

    If you can elaborate on where the strings are coming from, a more complete answer might be possible. You might also check out this answer for a possible global(-ish) Rails solution to default string encodings.

    0 讨论(0)
  • 2021-02-05 09:10

    Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü U+00FC = \u0252 (and in UTF-16). It could be an artifact of the Ruby pack/unpack functions. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:

    >> "- Menü -".unpack('U*').pack("U*")
    => "- Menü -"
    

    You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters (C):

    >> "- Menü -".unpack('U*').pack("C*")
    => "- Men\xFC -"
    

    This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):

    >> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
    => "- Menü -"
    

    In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set

    >> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
    => "- Menü -"
    >> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
    => "- Menü -"
    
    0 讨论(0)
  • 2021-02-05 09:11

    I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

    Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

    s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"
    

    Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

    s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
    s.valid_encoding?  # => true
    

    Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

    def to_utf8(str)
      str = str.force_encoding("UTF-8")
      return str if str.valid_encoding?
      str = str.force_encoding("BINARY")
      str.encode("UTF-8", invalid: :replace, undef: :replace)
    end
    

    That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

    0 讨论(0)
提交回复
热议问题