Same string but different bytes codes

家住魔仙堡 提交于 2021-02-08 13:12:41

问题


I have two strings:

a = 'hà nội'
b = 'hà nội'

When I compare them with a == b, it returns false.

I checked the byte codes:

a.bytes = [104, 97, 204, 128, 32, 110, 195, 180, 204, 163, 105]
b.bytes = [104, 195, 160, 32, 110, 225, 187, 153, 105]

What is the cause? How can I fix it so that a == b returns true?


回答1:


This is an issue with Unicode equivalence.

In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters.

a.unicode_normalize == b.unicode_normalize

unicode_normalize(form=:nfc) [link]

Returns a normalized form of str, using Unicode normalizations NFC, NFD, NFKC, or NFKD. The normalization form used is determined by form, which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. The default is :nfc.

If the string is not in a Unicode Encoding, then an Exception is raised. In this context, 'Unicode Encoding' means any of UTF-8, UTF-16BE/LE, and UTF-32BE/LE, as well as GB18030, UCS_2BE, and UCS_4BE. Anything else than UTF-8 is implemented by converting to UTF-8, which makes it slower than UTF-8.



来源:https://stackoverflow.com/questions/48472375/same-string-but-different-bytes-codes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!