问题
I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.
I want to split a String in Java, and I have 4 constraints:
- The delimiters are [.?!] (end of the sentence)
- Decimal numbers shouldn't be tokenized
- The delimiters shouldn't be removed.
- The minimum size of each token should be 5
For example, for input:
"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."
The output will be:
[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]
Up to now I got the answer for three first constraints by this regex:
text.split("(?<=[.!?])(?<!\\d)(?!\\d)");
And I know I should use {5,}
somewhere in my regex, but any combination that I tried doesn't work.
For cases like: "I love U.S. How about you?"
it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S.
as a separate sentence.
Finally, introducing a good tutorial of regex is appreciated.
UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.
So, Be careful! The accepted answer will not cover all possible use cases!
回答1:
Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z])
which means match any whitespace one or more times preceded with either .
, ?
or !
and followed by [a-z]
(not forgetting the i
modifier).
Now let's modify it to the needs of this question:
- We'll first convert it to a JAVA regex:
(?<=[.?!])\\s+(?=[a-z])
- We'll add the
i
modifier to match case insensitive(?i)(?<=[.?!])\\s+(?=[a-z])
- We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) :
(?=(?i)(?<=[.?!])\\s+(?=[a-z]))
- We'll add a negative lookbehind to check if there is no abbreviation in the format
LETTER DOT LETTER DOT
:(?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
.
Some links:
- Online tester, jump to JAVA
- Explain tool (Not JAVA based)
- THE regex tutorial
- Java specific regex tutorial
- SO regex chatroom
- Some advanced nice regex-fu on SO
- How does this regex find triangular numbers?
- How can we match a^n b^n with Java regex?
- How does this Java regex detect palindromes?
- How to determine if a number is a prime with regex?
- "vertical" regex matching in an ASCII "image"
- Can the for loop be eliminated from this piece of PHP code?
^-- See regex solution, although not sure if applicable in JAVA
回答2:
What about the next regular expression?
(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)
e.g.
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");
public static void main(String[] args) {
String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[Hello World!, This answer worth $1.45 in U.S., dollar., Thank you.]"
}
来源:https://stackoverflow.com/questions/18281206/java-regex-to-split-tokens-with-minimum-size-and-delimiters