Better text documents clustering than tf/idf and cosine similarity?

后端未结

关注

 3  407

走了就别回头了 2021-02-01 06:56

I\'m trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorit

3条回答

轻奢々 (楼主)

2021-02-01 07:25

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...