Text segmentation: dictionary-based word splitting [closed]

后端未结

关注

 3  1577

温柔的废话

相关标签:

3条回答

盖世英雄少女心

2020-12-30 16:03

Peter Norvig has written some stuff in python.

http://norvig.com/ngrams/ngrams.py

contains a function called segment. It run a Naive Bayes probability of a sequence of words. works well. Can be a good basis for what your trying to accomplish in java.

If you get it converted to java, i'd be interested in seeing the implementation.

Thanks.

Mike

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-12-30 16:07

Your problem is a very common one in NLP - don't start by reinventing the wheel - it will take you a long time and not be as good as what is out there already.

You should certainly start by seeing what the NLP libraries have to offer: http://en.wikipedia.org/wiki/Natural_language_processing and http://en.wikipedia.org/wiki/Category:Natural_language_processing_toolkits. Your problem is a common one and there are different approaches which you will need to explore for your corpus.

Your wordsplitting may be found under hyphenation routines. Two possible approaches are n-grams (where the frequency of (say) 4-character substrings are used to predict boundaries) and tries which show common starts or ends to words. Some of these may help with misspellings.

But there is no trivial answer - find what works best for you.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-30 16:19
I would approach the problem slightly differently. It's significant that "end" and "dependent" overlap, yet that is lost in your word Map. If instead of a single word map you were to create a set of word maps, with each one representing a possible segmentation of the column name, consisting only of non-overlapping words, you could compute a score for each segmentation based on the word's probabilities and word length. The score for a segmentation would be the average of the scores of the individual words in the segmentation. The score for an individual word would be some function of the word's length(l) and probability(p), something like
```
score=al + bp
```
where a and b are weights that you can tweak to get the right mix. Average the scores for each word to get the score for the segmentation and pick the segmentation with the highest score. The score function would not have to be a linear weighting either, you could experiment with logarithmic, exponential or higher-order polynomials (squares for instance)
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题