Java library that finds sentence boundaries

后端 未结 2 2053
说谎
说谎 2020-12-17 00:29

Does anyone know of a Java library that handles finding sentence boundaries? I\'m thinking that it would be a smart StringTokenizer implementation that knows about all of th

相关标签:
2条回答
  • 2020-12-17 01:26

    You want to look into the internationalized BreakIterator classes. A good starting point for sentence boundaries.

    0 讨论(0)
  • 2020-12-17 01:34

    You wrote:

    I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

    A basic problem here is that sentence terminators depend on the context, consider:

    How did Dr. Jones compute 5! without recursion?

    This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

    So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.

    0 讨论(0)
提交回复
热议问题