How can I globally ignore invalid byte sequences in UTF-8 strings?

后端未结

关注

 5  827

傲寒 2021-02-05 08:14

I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore all invalid byte sequences on it, to keep the backwards c

5条回答

时光说笑 (楼主)

2021-02-05 09:10
Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü U+00FC = \u0252 (and in UTF-16). It could be an artifact of the Ruby pack/unpack functions. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:
```
>> "- Menü -".unpack('U*').pack("U*")
=> "- Menü -"
```
You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters (C):
```
>> "- Menü -".unpack('U*').pack("C*")
=> "- Men\xFC -"
```
This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):
```
>> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
=> "- Menü -"
```
In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set
```
>> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
=> "- Menü -"
>> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
=> "- Menü -"
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...