Javascript Regex + Unicode Diacritic Combining Characters`

问题

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily? Thank you

回答1:

Usually this is made by combining an 'é' with a '\u0323' under dot diacritic

However, that isn't what you have here:

'ẹ́'

that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an ẹ with an acute diacritic.

The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.

I don't just want to match e. I want to match all combinations

Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.

Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.

Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

回答2:

Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.

But there exists the lib XRegExp that adds this support. With this lib you can use

\p{L}: to match any kind of letter from any language.

\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

So your character class would look like this:

[\p{L}\p{M}]+

that would match all possible letters that are in the Unicode table.

If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.

来源：https://stackoverflow.com/questions/17357716/javascript-regex-unicode-diacritic-combining-characters

标签

javascript

regex

unicode

diacritics