Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

可紊 提交于 2019-11-27 20:58:18

问题


In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

R renders the following (version 3.0.2, Mac OS 10.7.5):

> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"

However, of course:

> "\u00e9" == "\u0065\u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

Thanks a lot in advance.


回答1:


Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!



来源:https://stackoverflow.com/questions/20458834/unicode-normalization-form-c-in-r-convert-all-characters-with-accents-into-t

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!