How do I reverse Unicode decomposition using Python?

前端 未结 3 1860
隐瞒了意图╮
隐瞒了意图╮ 2020-12-16 05:58

Using Python 2.5, I have some text in stored in a unicode object:

Dinis e Isabel, uma difı´cil relac¸a˜o conjugal e polı´tica

3条回答
  •  醉梦人生
    2020-12-16 06:06

    Unfortunately it seems I actually have (for example) \u00B8 (cedilla) instead of \u0327 (combining cedilla) in my text.

    Eurgh, nasty! You can still do it automatically, though the process wouldn't be entirely lossless as it involves a compatibility decomposition (NFKD).

    Normalise U+00B8 to NFKD and you'll get a space followed by the U+0327. You could then scan through the string looking for any case of space-followed-by-combining-character, and remove the space. Finally recompose to NFC to put the combining characters onto the previous character instead.

    s= unicodedata.normalize('NFKD', s)
    s= ''.join(c for i, c in enumerate(s) if c!=' ' or unicodedata.combining(s[i+1])==0)
    s= unicodedata.normalize('NFC', s)
    

提交回复
热议问题