I intend to normalize to Form C, then divide into \"display units\", basically a glyph plus all following combining characters. For now, I\'m just looking to handle the Lati
OK I did hack up something similar recently. Enjoy!
public static List<String> stringToCharacterWithCombiningChars(String fullText) {
Pattern splitWithCombiningChars = Pattern.compile("(\\p{M}+|\\P{M}\\p{M}*)"); // {M} is any kind of 'mark' http://stackoverflow.com/questions/29110887/detect-any-combining-character-in-java/29111105
Matcher matcher = splitWithCombiningChars.matcher(fullText);
ArrayList<String> outGoing = new ArrayList<>();
while(matcher.find()) {
outGoing.add(matcher.group());
}
return outGoing;
}
With its accompanying (passing) unit test if it's of worth to followers: https://gist.github.com/rdp/0014de502f37abd64ffd
@lenz's answer covers most of the codepoints, but some were missing. Below a list of ranges found by processing the Names List file. Some codepoints have COMBINING
in the name, but are no combining characters, like for example the Combining Grapheme Joiner (CGJ, 0x34f) [wiki]. As is quoted in the Wikipedia article:
Its name is a misnomer and does not describe its function; the character does not join graphemes. Its purpose is to separate characters that should not be considered digraphs.
When processing the list, the following ranges (and characters) were found. Note the ones that (slightly differ) from lenz's list are denoted with an exclamation mark (!). Often the range is slightly off, for example because one of the characters is not in the range, and thus the range is "cut in two":
0x300 - 0x34e !
0x350 - 0x36f !
0x483 - 0x487 !
0x591 - 0x5bd !
0x5bf !
0x5c1 - 0x5c2 !
0x5c4 - 0x5c5 !
0x5c7 !
0x610 - 0x61a !
0x64b - 0x65f !
0x670 !
0x6d6 - 0x6dc !
0x6df - 0x6e4 !
0x6e7 - 0x6e8 !
0x6ea - 0x6ed !
0x711 !
0x730 - 0x74a !
0x7eb - 0x7f3
0x816 - 0x819 !
0x81b - 0x823 !
0x825 - 0x827 !
0x829 - 0x82d !
0x859 - 0x85b !
0x8d4 - 0x8e1 !
0x8e3 - 0x8ff !
0x93c !
0x94d !
0x951 - 0x954 !
0x9bc !
0x9cd !
0xa3c !
0xa4d !
0xabc !
0xacd !
0xb3c !
0xb4d !
0xbcd !
0xc4d !
0xc55 - 0xc56 !
0xcbc !
0xccd !
0xd4d !
0xdca !
0xe38 - 0xe3a !
0xe48 - 0xe4b !
0xeb8 - 0xeb9 !
0xec8 - 0xecb !
0xf18 - 0xf19 !
0xf35 !
0xf37 !
0xf39 !
0xf71 - 0xf72 !
0xf74 !
0xf7a - 0xf7d !
0xf80 !
0xf82 - 0xf84 !
0xf86 - 0xf87 !
0xfc6 !
0x1037 !
0x1039 - 0x103a !
0x108d !
0x135d - 0x135f !
0x1714 !
0x1734 !
0x17d2 !
0x17dd !
0x18a9 !
0x1939 - 0x193b !
0x1a17 - 0x1a18 !
0x1a60 !
0x1a75 - 0x1a7c !
0x1a7f
0x1ab0 - 0x1abd !
0x1b34 !
0x1b44 !
0x1b6b - 0x1b73
0x1baa - 0x1bab !
0x1be6 !
0x1bf2 - 0x1bf3 !
0x1c37 !
0x1cd0 - 0x1cd2 !
0x1cd4 - 0x1ce0 !
0x1ce2 - 0x1ce8 !
0x1ced !
0x1cf4 !
0x1cf8 - 0x1cf9 !
0x1dc0 - 0x1df5 !
0x1dfb - 0x1dff !
0x20d0 - 0x20dc !
0x20e1 !
0x20e5 - 0x20f0 !
0x2cef - 0x2cf1
0x2d7f !
0x2de0 - 0x2dff
0x302a - 0x302f !
0x3099 - 0x309a
0xa66f !
0xa674 - 0xa67d !
0xa69e - 0xa69f !
0xa6f0 - 0xa6f1
0xa806 !
0xa8c4 !
0xa8e0 - 0xa8f1
0xa92b - 0xa92d !
0xa953 !
0xa9b3 !
0xa9c0 !
0xaab0 !
0xaab2 - 0xaab4 !
0xaab7 - 0xaab8 !
0xaabe - 0xaabf !
0xaac1 !
0xaaf6 !
0xabed !
0xfb1e !
0xfe20 - 0xfe2f !
0x101fd
0x102e0 !
0x10376 - 0x1037a !
0x10a0d !
0x10a0f !
0x10a38 - 0x10a3a !
0x10a3f !
0x10ae5 - 0x10ae6 !
0x11046 !
0x1107f !
0x110b9 - 0x110ba !
0x11100 - 0x11102 !
0x11133 - 0x11134 !
0x11173 !
0x111c0 !
0x111ca !
0x11235 - 0x11236 !
0x112e9 - 0x112ea !
0x1133c !
0x1134d !
0x11366 - 0x1136c !
0x11370 - 0x11374 !
0x11442 !
0x11446 !
0x114c2 - 0x114c3 !
0x115bf - 0x115c0 !
0x1163f !
0x116b6 - 0x116b7 !
0x1172b !
0x11c3f !
0x16af0 - 0x16af4 !
0x16b30 - 0x16b36 !
0x1bc9e !
0x1d165 - 0x1d169
0x1d16d - 0x1d172
0x1d17b - 0x1d182
0x1d185 - 0x1d18b
0x1d1aa - 0x1d1ad
0x1d242 - 0x1d244
0x1e000 - 0x1e006 !
0x1e008 - 0x1e018 !
0x1e01b - 0x1e021 !
0x1e023 - 0x1e024 !
0x1e026 - 0x1e02a !
0x1e8d0 - 0x1e8d6 !
0x1e944 - 0x1e94a !
This results in a total of 814 codepoints.
These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT
):
300-36F
483-489
7EB-7F3
135F-135F
1A7F-1A7F
1B6B-1B73
1DC0-1DE6
1DFD-1DFF
20D0-20F0
2CEF-2CF1
2DE0-2DFF
3099-309A
A66F-A672
A67C-A67D
A6F0-A6F1
A8E0-A8F1
FE20-FE26
101FD-101FD
1D165-1D169
1D16D-1D172
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
1D242-1D244
I compiled this list with a Python script, making use of the unicodedata
module.
I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.
However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.