Let\'s say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt
Does anyone has a good algorithm, or open-source code, to extract wor
This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:
String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);
while ( m.find() ) {
System.out.println(input.substring(m.start(), m.end()));
}
The pattern [\w']+
matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.