Does anyone know of a Java library that handles finding sentence boundaries? I\'m thinking that it would be a smart StringTokenizer implementation that knows about all of th
You want to look into the internationalized BreakIterator classes. A good starting point for sentence boundaries.
You wrote:
I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
A basic problem here is that sentence terminators depend on the context, consider:
How did Dr. Jones compute 5! without recursion?
This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.
So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.