How can I globally ignore invalid byte sequences in UTF-8 strings?

后端 未结 5 816
傲寒
傲寒 2021-02-05 08:14

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c

5条回答
  •  别那么骄傲
    2021-02-05 09:00

    If you can configure your database/page/whatever to give you strings in ASCII-8BIT, this will get you their real encoding.

    Use Ruby's stdlib encoding guessing library. Pass all your strings through something like this:

    require 'nkf'
    str = "- Men\xFC -"
    str.force_encoding(NKF.guess(str))
    

    The NKF library will guess the encoding (usually successfully), and force that encoding on the string. If you don't feel like trusting the NKF library totally, build this safeguard around string operations too:

    begin
      str.split
    rescue ArgumentError
      str.force_encoding('BINARY')
      retry
    end
    

    This will fallback on BINARY if NKF didn't guess correctly. You can turn this into a method wrapper:

    def str_op(s)
      begin
        yield s
      rescue ArgumentError
        s.force_encoding('BINARY')
        retry
      end
    end
    

提交回复
热议问题