Better text documents clustering than tf/idf and cosine similarity?

后端 未结 3 414
走了就别回头了
走了就别回头了 2021-02-01 06:56

I\'m trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorit

3条回答
  •  北恋
    北恋 (楼主)
    2021-02-01 07:36

    Long answer:

    TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).

    You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.

    The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.

    Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

提交回复
热议问题