How to handle Combining Diacritical Marks with UnicodeUtils?

笑着哭i 提交于 2019-12-08 07:44:56

问题


I am trying to insert spaces into a string of IPA characters, e.g. to turn ɔ̃wɔ̃tɨ into ɔ̃ w ɔ̃ t ɨ. Using split/join was my first thought:

s = ɔ̃w̃ɔtɨ
s.split('').join(' ') #=> ̃ ɔ w ̃ ɔ p t ɨ

As I discovered by examining the results, letters with diacritics are in fact encoded as two characters. After some research I found the UnicodeUtils module, and used the each_grapheme method:

UnicodeUtils.each_grapheme(s) {|g| g + ' '} #=> ɔ ̃w ̃ɔ p t ɨ

This worked fine, except for the inverted breve mark. The code changes ̑a into ̑ a. I tried normalization (UnicodeUtils.nfc, UnicodeUtils.nfd), but to no avail. I don't know why the each_grapheme method has a problem with this particular diacritic mark, but I noticed that in gedit, the breve is also treated as a separate character, as opposed to tildes, accents etc. So my question is as follows: is there a straightforward method of normalization, i.e. turning the combination of Latin Small Letter A and Combining Inverted Breve into Latin Small Letter A With Inverted Breve?


回答1:


I understand your question concerns Ruby but I suppose the problem is about the same as with Python. A simple solution is to test the combining diacritical marks explicitly :

import unicodedata
liste=[]
s = u"ɔ̃w̃ɔtɨ"
comb=False
prec=u""
for char in s:
    if unicodedata.combining(char):
        liste.append(prec+char)
        prec=""
    else:
        liste.append(prec)
        prec=char
liste.append(prec)
print " ".join(liste)
>>>>  ɔ̃  w̃  ɔ t ɨ


来源:https://stackoverflow.com/questions/23873771/how-to-handle-combining-diacritical-marks-with-unicodeutils

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!