How can I globally ignore invalid byte sequences in UTF-8 strings?

后端 未结 5 819
傲寒
傲寒 2021-02-05 08:14

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c

5条回答
  •  离开以前
    2021-02-05 09:06

    If you just want to operate on the raw bytes, you can try encoding it as ASCII-8BIT/BINARY.

    str.force_encoding("BINARY").split("n")
    

    This isn't going to get your ü back, though, since your source string in this case is ISO-8859-1 (or something like it):

    "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
     => "- Menü -"
    

    If you want to get multibyte characters, you have to know what the source charset is. Once you force_encoding to BINARY, you're going to literally just have the raw bytes, so multibyte characters won't be interpreted accordingly.

    If the data is from your database, you can change your connection mechanism to use an ASCII-8BIT or BINARY encoding; Ruby should flag them accordingly then. Alternately, you can monkeypatch the database driver to force encoding on all strings read from it. This is a massive hammer, though, and might be the absolutely wrong thing to do.

    The right answer is going to be to fix your string encodings. This may require a database fix, a database driver connection encoding fix, or some combination thereof. All the bytes are still there, but if you're dealing with a given charset, you should, if at all possible, let Ruby know that you expect your data to be in that encoding. A common mistake is to use the mysql2 driver to connect to a MySQL database which has data in latin1 encodings, but to specify a utf-8 charset for the connection. This causes Rails to take the latin1 data from the DB and interpret it as utf-8, rather than interpreting it as latin1 which you can then convert to UTF-8.

    If you can elaborate on where the strings are coming from, a more complete answer might be possible. You might also check out this answer for a possible global(-ish) Rails solution to default string encodings.

提交回复
热议问题