I have to separate a line of text into words, and am confused on what regex to use. I have looked everywhere for a regex that matches a word and found ones similar to this
Using answer from WhirlWind
on the page stated in my comment you can do the following:
String candidate = "I \n"+
"like \n"+
"to "+
"eat "+
"but "+
"I "+
"don't "+
"like "+
"to "+
"eat "+
"everyone's "+
"food "+
"'' '''' '.' ' "+
"or "+
"they'll "+
"starv'e'";
String regex = "('\\w+)|(\\w+'\\w+)|(\\w+')|(\\w+)";
Matcher matcher = Pattern.compile(regex).matcher(candidate);
while (matcher.find()) {
System.out.println("> matched: `" + matcher.group() + "`");
}
It will print:
> matched: `I`
> matched: `like`
> matched: `to`
> matched: `eat`
> matched: `but`
> matched: `I`
> matched: `don't`
> matched: `like`
> matched: `to`
> matched: `eat`
> matched: `everyone's`
> matched: `food`
> matched: `or`
> matched: `they'll`
> matched: `starv'e`
You can find a running example here: http://ideone.com/pVOmSK
The following regex seems to cover your sample string correctly. But it doesn't cover you scenario for the apostrophe.
[\s,.?!"]+
Java Code:
String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
String[] inputWords = input.split("[\\s,.?!]+");
If I understand correctly, the apostrophe should be left alone as long as it is after a word character. This next regex should cover the above plus the special case for the apostrophe.
(?<!\w)'|[\s,.?"!][\s,.?"'!]*
Java Code:
String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
String[] inputWords = input.split("(?<!\\w)'|[\\s,.?\"!][\\s,.?\"'!]*");
If I run the second regex on the string: Hey there! Don't eat 'the mystery meat'.
I get the following words in my string array:
Hey
there
Don't
eat
the
mystery
meat'