How to classify Japanese characters as either kanji or kana?

后端未结

关注

 5  922

Given the text below, how can I classify each character as kana or kanji?

誰か確認上記これらのフ

To get some thing like this

誰 - kanji
か - kana
確 - kanji
認


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-02-01 23:38
              
            
            
                                                                       
I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:

Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.

Function KanjiKanaBreakdown(ByVal text As String) As String

Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long

For i = 1 To Len(text)
    If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
        result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
    Else
        result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
    End If
Next

KanjiKanaBreakdown = result
Application.ScreenUpdating = True

End Function

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-02-01 23:41
              
            
            
                                                                       
Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2021-02-01 23:41
              
            
            
                                                                       
This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:
public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

誰か確認上記これらのフ
Hiragana: かこれらの
Katakana: フ
Kana: かこれらのフ
Kanji: 誰確認上記

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.
Edit:
Based on jleedev's answer, you could also write a method like:
public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:
CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-02-01 23:50
              
            
            
                                                                       
This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language:

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('ﾌ') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('！') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION


But, as always, the devil is in the details:

Character.UnicodeBlock.of('Ａ') == HALFWIDTH_AND_FULLWIDTH_FORMS


where Ａ is the full-width character. So this is in the same category as the halfwidth Katakana ﾌ above. Note that the full-width Ａ is different from the normal (half-width) A:

Character.UnicodeBlock.of('A') == BASIC_LATIN

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2021-02-01 23:53
              
            
            
                                                                       
You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复