Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

后端未结

关注

 12  742

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their \"simple\" character.

For example:

相关标签:

12条回答

一个人的身影

2020-11-22 11:58
It's part of Apache Commons Lang as of ver. 3.1.
```
org.apache.commons.lang3.StringUtils.stripAccents("Añ");
```
returns An
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-11-22 11:58

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-22 11:58

In Windows and .NET, I just convert using string encoding. That way I avoid manual mapping and coding.

Try to play with string encoding.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-11-22 12:01
The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

Such as:
```
start    = 0x00C0
size     = 23
mappings = {
    "A","A","A","A","A","A","AE","C",
    "E","E","E","E","I","I","I", "I",
    "D","N","O","O","O","O","O"
}
start    = 0x00D8
size     = 6
mappings = {
    "O","U","U","U","U","Y"
}
start    = 0x00E0
size     = 23
mappings = {
    "a","a","a","a","a","a","ae","c",
    "e","e","e","e","i","i","i", "i",
    "d","n","o","o","o","o","o"
}
start    = 0x00F8
size     = 6
mappings = {
    "o","u","u","u","u","y"
}
: : :
```
The use of a sparse array will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the æ grapheme becoming ae).

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-11-22 12:03

Unicode has specific diatric characters (which are composite characters) and a string can be converted so that the character and the diatrics are separated. Then, you can just remove the diatricts from the string and you're basically done.

For more information on normalization, decompositions and equivalence, see The Unicode Standard at the Unicode home page.

However, how you can actually achieve this depends on the framework/OS/... you're working on. If you're using .NET, you can use the String.Normalize method accepting the System.Text.NormalizationForm enumeration.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-11-22 12:11
There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

Here's a discussion and implementation of diacritic marker removal using Perl.

These existing SO questions are related:
- How to convert UTF-8 to US ASCII
- How to change diacritic characters to non-diacritic ones
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2