How can I find only 'interesting' words from a corpus?

后端 未结 4 1842
面向向阳花
面向向阳花 2020-12-25 08:44

I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as \"semi-unique words\" in relation to the rest of the corpus. Something simil

4条回答
  •  一生所求
    2020-12-25 09:06

    Take a look at this article (Level statistics of words: Finding keywords in literary texts and symbolic sequences, published in Phys. Rev. E).

    The picture on the first page together with its caption explain the crucial observation. In Don Quixote, the words "but" and "Quixote" appear with similar frequencies, but their spectra are quite different (occurrences of "Quixote" are clustered while occurrences of "but" are more evenly spaced). Therefore, "Quixote" can be classified as an interesting word (keyword) while "but" is ignored.

    It might or might not be what you're looking for, but I guess it won't hurt to be familiar with this result.

提交回复
热议问题