Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

后端未结

关注

 12  745

故里飘歌 2020-11-22 11:42

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their \"simple\" character.

For example:

12条回答

囚心锁ツ (楼主)

2020-11-22 11:45
You could use the Normalizer class from java.text:
```
System.out.println(new String(Normalizer.normalize("ń ǹ ň ñ ṅ ņ ṇ ṋ", Normalizer.Form.NFKD).getBytes("ascii"), "ascii"));
```
But there is still some work to do, since Java makes strange things with unconvertable Unicode characters (it does not ignore them, and it does not throw an exception). But I think you could use that as an starting point.
0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...