Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

柔情痞子 提交于 2019-12-10 13:34:26

问题


I accidentally answered a question where the original problem involved splitting sentence to separate words.

And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.

I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?

Please, explain me the pros of using BreakIterator and the real cases when it should be used.

If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?


回答1:


From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?




回答2:


The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.

It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.

If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.

  1. Find the sentence where any word occurs the most repeated times. Output it along with that word
  2. Find the word used most times throughout the whole string.
  3. Find all words that occur in every sentence
  4. Find all words that occur a prime number of times in 2 or more sentences

The list goes on...



来源:https://stackoverflow.com/questions/4482469/splitting-text-to-sentences-and-sentence-to-words-breakiterator-vs-regular-expr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!