发表新帖

发表新帖

Text segmentation: dictionary-based word splitting [closed]

后端未结

关注

 3  1576

温柔的废话 2020-12-30 15:58

3条回答

醉梦人生 (楼主)

2020-12-30 16:07

Your problem is a very common one in NLP - don't start by reinventing the wheel - it will take you a long time and not be as good as what is out there already.

You should certainly start by seeing what the NLP libraries have to offer: http://en.wikipedia.org/wiki/Natural_language_processing and http://en.wikipedia.org/wiki/Category:Natural_language_processing_toolkits. Your problem is a common one and there are different approaches which you will need to explore for your corpus.

Your wordsplitting may be found under hyphenation routines. Two possible approaches are n-grams (where the frequency of (say) 4-character substrings are used to predict boundaries) and tries which show common starts or ends to words. Some of these may help with misspellings.

But there is no trivial answer - find what works best for you.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题