Better text documents clustering than tf/idf and cosine similarity?

后端 未结 3 407
走了就别回头了
走了就别回头了 2021-02-01 06:56

I\'m trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorit

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-02-01 07:25

    In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

    Topic models such as LDA might work even better.

提交回复
热议问题