I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as \"semi-unique words\" in relation to the rest of the corpus. Something simil
Take a look at this article (Level statistics of words: Finding keywords in literary texts and symbolic sequences, published in Phys. Rev. E).
The picture on the first page together with its caption explain the crucial observation. In Don Quixote, the words "but" and "Quixote" appear with similar frequencies, but their spectra are quite different (occurrences of "Quixote" are clustered while occurrences of "but" are more evenly spaced). Therefore, "Quixote" can be classified as an interesting word (keyword) while "but" is ignored.
It might or might not be what you're looking for, but I guess it won't hurt to be familiar with this result.