regex words matching for Chinese and Japanese character

后端未结

关注

 2  1973

I know the pattern to detect if it\'s a string is chinese character but that\'s not what I need. I need to check if the characters is found in a string.

cons


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2021-01-20 17:51
              
            
            
                                                                       
Read the documentation for word boundaries.


  A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.


where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.

Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.

With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?

Using Unicode regexp properties

However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use

new RegExp(`${word}(?=$|\P{Letter})`, "u")


Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.

Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.

By the way, to match any "non-word" symbol in Unicode, you can use:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2021-01-20 17:58
              
            
            
                                                                       
\b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复