Given the text below, how can I classify each character as kana or kanji?
誰か確認上記これらのフ
To get some thing like this
誰 - kanji
か - kana
確 - kanji
認
This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:
public class JapaneseCharMatchers {
public static final CharMatcher HIRAGANA =
CharMatcher.inRange((char) 0x3040, (char) 0x309f);
public static final CharMatcher KATAKANA =
CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);
public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);
public static final CharMatcher KANJI =
CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);
public static void main(String[] args) {
test("誰か確認上記これらのフ");
}
private static void test(String string) {
System.out.println(string);
System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
System.out.println("Katakana: " + KATAKANA.retainFrom(string));
System.out.println("Kana: " + KANA.retainFrom(string));
System.out.println("Kanji: " + KANJI.retainFrom(string));
}
}
Running this prints the expected:
誰か確認上記これらのフ
Hiragana: かこれらの
Katakana: フ
Kana: かこれらのフ
Kanji: 誰確認上記
This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter
class.
Edit:
Based on jleedev's answer, you could also write a method like:
public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
return new CharMatcher() {
public boolean matches(char c) {
return Character.UnicodeBlock.of(c) == block;
}
};
}
and use it like:
CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);
I think this might be a bit slower than the other version though.