How to classify Japanese characters as either kanji or kana?

后端 未结 5 925
既然无缘
既然无缘 2021-02-01 22:57

Given the text below, how can I classify each character as kana or kanji?

誰か確認上記これらのフ

To get some thing like this

誰 - kanji
か - kana
確 - kanji
認          


        
5条回答
  •  死守一世寂寞
    2021-02-01 23:41

    This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:

    public class JapaneseCharMatchers {
      public static final CharMatcher HIRAGANA = 
          CharMatcher.inRange((char) 0x3040, (char) 0x309f);
    
      public static final CharMatcher KATAKANA = 
          CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);
    
      public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);
    
      public static final CharMatcher KANJI = 
          CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);
    
      public static void main(String[] args) {
        test("誰か確認上記これらのフ");
      }
    
      private static void test(String string) {
        System.out.println(string);
        System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
        System.out.println("Katakana: " + KATAKANA.retainFrom(string));
        System.out.println("Kana: " + KANA.retainFrom(string));
        System.out.println("Kanji: " + KANJI.retainFrom(string));
      }
    }
    

    Running this prints the expected:

    誰か確認上記これらのフ

    Hiragana: かこれらの

    Katakana: フ

    Kana: かこれらのフ

    Kanji: 誰確認上記

    This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.

    Edit:

    Based on jleedev's answer, you could also write a method like:

    public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
      return new CharMatcher() {
        public boolean matches(char c) {
          return Character.UnicodeBlock.of(c) == block;
        }
      };
    }
    

    and use it like:

    CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);
    

    I think this might be a bit slower than the other version though.

提交回复
热议问题