Java - Regex to Split Tokens With Minimum Size and Delimiters

淺唱寂寞╮ 提交于 2019-12-13 01:43:40

问题


I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.

I want to split a String in Java, and I have 4 constraints:

  1. The delimiters are [.?!] (end of the sentence)
  2. Decimal numbers shouldn't be tokenized
  3. The delimiters shouldn't be removed.
  4. The minimum size of each token should be 5

For example, for input:

"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."

The output will be:

[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]

Up to now I got the answer for three first constraints by this regex:

text.split("(?<=[.!?])(?<!\\d)(?!\\d)");

And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.

For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.

Finally, introducing a good tutorial of regex is appreciated.

UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.

So, Be careful! The accepted answer will not cover all possible use cases!


回答1:


Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).

Now let's modify it to the needs of this question:

  1. We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
  2. We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
  3. We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
  4. We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])

So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).

Some links:

  • Online tester, jump to JAVA
  • Explain tool (Not JAVA based)
  • THE regex tutorial
  • Java specific regex tutorial
  • SO regex chatroom
  • Some advanced nice regex-fu on SO
    • How does this regex find triangular numbers?
    • How can we match a^n b^n with Java regex?
    • How does this Java regex detect palindromes?
    • How to determine if a number is a prime with regex?
    • "vertical" regex matching in an ASCII "image"
    • Can the for loop be eliminated from this piece of PHP code?
      ^-- See regex solution, although not sure if applicable in JAVA



回答2:


What about the next regular expression?

(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)

e.g.

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");

public static void main(String[] args) {
    String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";

    System.out.println(java.util.Arrays.toString(
        REGEX_PATTERN.split(input)
    )); // prints "[Hello World!,  This answer worth $1.45 in U.S.,  dollar.,  Thank you.]"
}


来源:https://stackoverflow.com/questions/18281206/java-regex-to-split-tokens-with-minimum-size-and-delimiters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!