Better text documents clustering than tf/idf and cosine similarity?

后端未结

关注

 3  414

走了就别回头了 2021-02-01 06:56

I\'m trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorit

3条回答

北恋 (楼主)

2021-02-01 07:36

Long answer:

TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).

You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.

The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.

Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...