Algorithm to check for combining characters in Unicode

后端 未结 3 1616
耶瑟儿~
耶瑟儿~ 2020-12-18 02:45

I intend to normalize to Form C, then divide into \"display units\", basically a glyph plus all following combining characters. For now, I\'m just looking to handle the Lati

相关标签:
3条回答
  • 2020-12-18 03:24

    OK I did hack up something similar recently. Enjoy!

      public static List<String> stringToCharacterWithCombiningChars(String fullText) {
        Pattern splitWithCombiningChars = Pattern.compile("(\\p{M}+|\\P{M}\\p{M}*)"); // {M} is any kind of 'mark' http://stackoverflow.com/questions/29110887/detect-any-combining-character-in-java/29111105
        Matcher matcher = splitWithCombiningChars.matcher(fullText);
        ArrayList<String> outGoing = new ArrayList<>();
        while(matcher.find()) {
          outGoing.add(matcher.group());
        }
        return outGoing;
      }
    

    With its accompanying (passing) unit test if it's of worth to followers: https://gist.github.com/rdp/0014de502f37abd64ffd

    0 讨论(0)
  • 2020-12-18 03:32

    @lenz's answer covers most of the codepoints, but some were missing. Below a list of ranges found by processing the Names List file. Some codepoints have COMBINING in the name, but are no combining characters, like for example the Combining Grapheme Joiner (CGJ, 0x34f) [wiki]. As is quoted in the Wikipedia article:

    Its name is a misnomer and does not describe its function; the character does not join graphemes. Its purpose is to separate characters that should not be considered digraphs.

    When processing the list, the following ranges (and characters) were found. Note the ones that (slightly differ) from lenz's list are denoted with an exclamation mark (!). Often the range is slightly off, for example because one of the characters is not in the range, and thus the range is "cut in two":

      0x300 -   0x34e  !
      0x350 -   0x36f  !
      0x483 -   0x487  !
      0x591 -   0x5bd  !
      0x5bf            !
      0x5c1 -   0x5c2  !
      0x5c4 -   0x5c5  !
      0x5c7            !
      0x610 -   0x61a  !
      0x64b -   0x65f  !
      0x670            !
      0x6d6 -   0x6dc  !
      0x6df -   0x6e4  !
      0x6e7 -   0x6e8  !
      0x6ea -   0x6ed  !
      0x711            !
      0x730 -   0x74a  !
      0x7eb -   0x7f3
      0x816 -   0x819  !
      0x81b -   0x823  !
      0x825 -   0x827  !
      0x829 -   0x82d  !
      0x859 -   0x85b  !
      0x8d4 -   0x8e1  !
      0x8e3 -   0x8ff  !
      0x93c            !
      0x94d            !
      0x951 -   0x954  !
      0x9bc            !
      0x9cd            !
      0xa3c            !
      0xa4d            !
      0xabc            !
      0xacd            !
      0xb3c            !
      0xb4d            !
      0xbcd            !
      0xc4d            !
      0xc55 -   0xc56  !
      0xcbc            !
      0xccd            !
      0xd4d            !
      0xdca            !
      0xe38 -   0xe3a  !
      0xe48 -   0xe4b  !
      0xeb8 -   0xeb9  !
      0xec8 -   0xecb  !
      0xf18 -   0xf19  !
      0xf35            !
      0xf37            !
      0xf39            !
      0xf71 -   0xf72  !
      0xf74            !
      0xf7a -   0xf7d  !
      0xf80            !
      0xf82 -   0xf84  !
      0xf86 -   0xf87  !
      0xfc6            !
     0x1037            !
     0x1039 -  0x103a  !
     0x108d            !
     0x135d -  0x135f  !
     0x1714            !
     0x1734            !
     0x17d2            !
     0x17dd            !
     0x18a9            !
     0x1939 -  0x193b  !
     0x1a17 -  0x1a18  !
     0x1a60            !
     0x1a75 -  0x1a7c  !
     0x1a7f
     0x1ab0 -  0x1abd  !
     0x1b34            !
     0x1b44            !
     0x1b6b -  0x1b73
     0x1baa -  0x1bab  !
     0x1be6            !
     0x1bf2 -  0x1bf3  !
     0x1c37            !
     0x1cd0 -  0x1cd2  !
     0x1cd4 -  0x1ce0  !
     0x1ce2 -  0x1ce8  !
     0x1ced            !
     0x1cf4            !
     0x1cf8 -  0x1cf9  !
     0x1dc0 -  0x1df5  !
     0x1dfb -  0x1dff  !
     0x20d0 -  0x20dc  !
     0x20e1            !
     0x20e5 -  0x20f0  !
     0x2cef -  0x2cf1
     0x2d7f            !
     0x2de0 -  0x2dff
     0x302a -  0x302f  !
     0x3099 -  0x309a
     0xa66f            !
     0xa674 -  0xa67d  !
     0xa69e -  0xa69f  !
     0xa6f0 -  0xa6f1
     0xa806            !
     0xa8c4            !
     0xa8e0 -  0xa8f1
     0xa92b -  0xa92d  !
     0xa953            !
     0xa9b3            !
     0xa9c0            !
     0xaab0            !
     0xaab2 -  0xaab4  !
     0xaab7 -  0xaab8  !
     0xaabe -  0xaabf  !
     0xaac1            !
     0xaaf6            !
     0xabed            !
     0xfb1e            !
     0xfe20 -  0xfe2f  !
    0x101fd
    0x102e0            !
    0x10376 - 0x1037a  !
    0x10a0d            !
    0x10a0f            !
    0x10a38 - 0x10a3a  !
    0x10a3f            !
    0x10ae5 - 0x10ae6  !
    0x11046            !
    0x1107f            !
    0x110b9 - 0x110ba  !
    0x11100 - 0x11102  !
    0x11133 - 0x11134  !
    0x11173            !
    0x111c0            !
    0x111ca            !
    0x11235 - 0x11236  !
    0x112e9 - 0x112ea  !
    0x1133c            !
    0x1134d            !
    0x11366 - 0x1136c  !
    0x11370 - 0x11374  !
    0x11442            !
    0x11446            !
    0x114c2 - 0x114c3  !
    0x115bf - 0x115c0  !
    0x1163f            !
    0x116b6 - 0x116b7  !
    0x1172b            !
    0x11c3f            !
    0x16af0 - 0x16af4  !
    0x16b30 - 0x16b36  !
    0x1bc9e            !
    0x1d165 - 0x1d169
    0x1d16d - 0x1d172
    0x1d17b - 0x1d182
    0x1d185 - 0x1d18b
    0x1d1aa - 0x1d1ad
    0x1d242 - 0x1d244
    0x1e000 - 0x1e006  !
    0x1e008 - 0x1e018  !
    0x1e01b - 0x1e021  !
    0x1e023 - 0x1e024  !
    0x1e026 - 0x1e02a  !
    0x1e8d0 - 0x1e8d6  !
    0x1e944 - 0x1e94a  !
    

    This results in a total of 814 codepoints.

    0 讨论(0)
  • 2020-12-18 03:34

    These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT):

    300-36F
    483-489
    7EB-7F3
    135F-135F
    1A7F-1A7F
    1B6B-1B73
    1DC0-1DE6
    1DFD-1DFF
    20D0-20F0
    2CEF-2CF1
    2DE0-2DFF
    3099-309A
    A66F-A672
    A67C-A67D
    A6F0-A6F1
    A8E0-A8F1
    FE20-FE26
    101FD-101FD
    1D165-1D169
    1D16D-1D172
    1D17B-1D182
    1D185-1D18B
    1D1AA-1D1AD
    1D242-1D244

    I compiled this list with a Python script, making use of the unicodedata module. I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.

    However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.

    0 讨论(0)
提交回复
热议问题