Javascript Regex + Unicode Diacritic Combining Characters`

独自空忆成欢 提交于 2019-12-30 21:52:26

问题


I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily? Thank you


回答1:


Usually this is made by combining an 'é' with a '\u0323' under dot diacritic

However, that isn't what you have here:

'ẹ́'

that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an with an acute diacritic.

The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.

I don't just want to match e. I want to match all combinations

Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.

Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.

Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.




回答2:


Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.

But there exists the lib XRegExp that adds this support. With this lib you can use

\p{L}: to match any kind of letter from any language.

\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

So your character class would look like this:

[\p{L}\p{M}]+

that would match all possible letters that are in the Unicode table.

If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.



来源:https://stackoverflow.com/questions/17357716/javascript-regex-unicode-diacritic-combining-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!