Calculate correlation coefficient between words?

孤街浪徒 提交于 2021-01-27 06:32:09

问题


For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.

This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.

  1. How can I best approach this problem?
  2. How can I calculate the correlation between words?

I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words


回答1:


The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.




回答2:


Well a simple way to solve your question is by shaping the data in a 2x2 matrix

            obama | not obama
barack      A       B
not barack  C       D

and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.




回答3:


I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.

Suppose the text has length N, say it is an array

text[0], text[1], ..., text[N-1]

Suppose the following words appear in the text

word[0], word[1], ..., word[k]

For each word word[i], define a vector of length N-1

X[i] = array(); // of  length N-1

as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.

// compute the vector X[i]
for (j = 0:N-2){
  if (text[j] == word[i] OR text[j+1] == word[i])
    X[i][j] = 1;
  else
    X[i][j] = 0;
}

Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.

This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.




回答4:


This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.

If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.

You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.

Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.



来源:https://stackoverflow.com/questions/12917928/calculate-correlation-coefficient-between-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!