What are some good methods to find the “relatedness” of two bodies of text?

后端 未结 7 821
小鲜肉
小鲜肉 2021-02-02 03:46

Here\'s the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to c

相关标签:
7条回答
  • 2021-02-02 04:29

    This is quite doable for reasonable large texts, however harder for smaller texts.

    I did it once like this, and it worked pretty well:

    • Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
    • Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
    • As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
    0 讨论(0)
提交回复
热议问题