regex words matching for Chinese and Japanese character

后端 未结 2 1973
名媛妹妹
名媛妹妹 2021-01-20 17:30

I know the pattern to detect if it\'s a string is chinese character but that\'s not what I need. I need to check if the characters is found in a string.

cons         


        
相关标签:
2条回答
  • 2021-01-20 17:51

    Read the documentation for word boundaries.

    A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.

    where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.

    Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.

    With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?

    Using Unicode regexp properties

    However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use

    new RegExp(`${word}(?=$|\P{Letter})`, "u")
    

    Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.

    Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.

    By the way, to match any "non-word" symbol in Unicode, you can use:

    [^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
    
    0 讨论(0)
  • 2021-01-20 17:58

    \b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.

    0 讨论(0)
提交回复
热议问题