Second-order cooccurrence of terms in texts

问题

Basically, I want to reimplement this video.

Given a corpus of documents, I want to find the terms that are most similar to each other.

I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix.

Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M has n rows and 4 columns.

HAVE:

M = [[18,34,54,65],   # Term IDs similar to Term t_0
     [18,12,54,65],   # Term IDs similar to Term t_1
     ...
     [21,43,55,78]]   # Term IDs similar to Term t_n.

So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M above, it seems that term t_0 and term t_1 are quite similar, because three out of four terms match, where as terms t_0 and t_nare not similar, because no terms match. Let's write M as a series of lists.

M = [list_0,   # Term IDs similar to Term t_0
     list_1,   # Term IDs similar to Term t_1
     ...
     list_n]   # Term IDs similar to Term t_n.

WANT:

C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
     [f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
     ...
     [f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]

I'd like to find the matrix C, that has as its elements, a function f applied to the lists of M. f(a,b) measures the degree of similarity between two lists a and b. Going, with the example above, the degree of similarity between t_0 and t_1 should be high, whereas the degree of similarity of t_0 and t_n should be low.

My questions:

What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?
Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Preferably a python package?

Thank you, r0f1

回答1:

You asked two questions, one somewhat open-ended (the first one) and other one that has a definitive answer, so I will start by the second one:

Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Preferably, a python package?

The answer is yes, there is one package named scipy.spatial.distance that contains a function that takes a matrix like M and produces a matrix like C. The following example is to show the function:

import numpy as np
from scipy.spatial.distance import pdist, squareform

# initial data
M = [[18, 34, 54, 65],
     [18, 12, 54, 65],
     [21, 43, 55, 78]]

# convert to numpy array
arr = np.array(M)

result = squareform(pdist(M, metric='euclidean'))
print(result)

Output

[[ 0.         22.         16.1245155 ]
 [22.          0.         33.76388603]
 [16.1245155  33.76388603  0.        ]]

As seen from the example above, pdist takes the M matrix and generates an C matrix. Note that the output of pdist is a condensed distance matrix, so you need to convert it to square form using squareform. Now onto the second issue:

What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?

Given that order does matter in your particular case I suggest you look at rank correlation coefficients such as: Kendall or Spearman, both are provided in the scipy.stats package, along with a whole bunch of other coefficients. Usage example:

import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.stats import kendalltau, spearmanr

# distance function
kendall = lambda x, y : kendalltau(x, y)[0]
spearman = lambda x, y : spearmanr(x, y)[0]


# initial data
M = [[18, 34, 54, 65],
     [18, 12, 54, 65],
     [21, 43, 55, 78]]

# convert to numpy array
arr = np.array(M)

# compute kendall C and convert to square form
kendall_result = 1 - squareform(pdist(arr, kendall))  # subtract 1 because you want a similarity
print(kendall_result)
print()

# compute spearman C and convert to square form
spearman_result = 1 - squareform(pdist(arr, spearman))  # subtract 1 because you want a similarity
print(spearman_result)
print()

Output

[[1.         0.33333333 0.        ]
 [0.33333333 1.         0.33333333]
 [0.         0.33333333 1.        ]]

[[1.  0.2 0. ]
 [0.2 1.  0.2]
 [0.  0.2 1. ]]

If those do not fit your needs you can take a look at the Hamming distance, for example:

import numpy as np
from scipy.spatial.distance import pdist, squareform

# initial data
M = [[18, 34, 54, 65],
     [18, 12, 54, 65],
     [21, 43, 55, 78]]

# convert to numpy array
arr = np.array(M)

# compute match_rank C and convert to square form
result = 1 - squareform(pdist(arr, 'hamming'))
print(result)

Output

[[1.   0.75 0.  ]
 [0.75 1.   0.  ]
 [0.   0.   1.  ]]

In the end the choice of the similarity function will depend on your final application, so you will need to try out different functions and see the ones that fit your needs. Both scipy.spatial.distance and scipy.stats provide a plethora of distance and coefficient functions you can try out.

Further

The following paper contains a section on list similarity

回答2:

In fact, cosine similarity might not be too bad in this case. The problem is, that you don't want to use the index vectors (i.e. [18,34,54,65] and so on in your case), but you want vectors of length n that are zero everywhere except for the values in your index vector. Luckily, you don't have to create those vectors explicitly, but you can just count how many indices the two index vectors have in common:

def f(u, v):
    return len(set(u).intersection(set(v)))

Here, I omitted a constant normalization factor k. There are some more elaborate things that one could do (for example the TF-IDF kernel), but I would stay with this for the start.

In order to run this efficiently using numpy, you would want to do two things:

Convert f to a ufunc, i.e. a numpy vectorized function. You can do that by uf = np.frompyfunc(f, 2, 1) (assuming that you did import numpy as np at some point).

Store M as a 1d array of lists (basically what you state in your second code listing). That's a little more tricky, because numpy is trying to be smart here, but you want something else. So here is how to do that:

n = len(M)
Marray = np.empty(n, dtype='O')  # dtype='O' allows you to have elements of type list
for i in range(n):
    Marray[i] = M[i]

Now, Marray contains essentially what you described in your second code listing. You can then use the new ufunc's outer method to get your similarity matrix. Here is how all of that would work together for your M from the example (assuming n=3):

M = [[18, 34, 54, 65],
     [18, 12, 54, 65],
     [21, 43, 55, 78]]
n = len(M)  # i.e. 3
uf = np.frompyfunc(f, 2, 1)
Marray = np.empty(n, dtype='O')
for i in range(n):
    Marray[i] = M[i]
similarities = uf.outer(Marray, Marray).astype('d')  # convert to float instead object type
print(similarities)
# array([[4., 3., 0.],
#        [3., 4., 0.],
#        [0., 0., 4.]])

I hope that answers your questions.

回答3:

I would suggest cosine similarity as every list is an vector.

     from sklearn.metrics.pairwise import cosine_similarity

     cosine_similarity(list0,list1)

来源：https://stackoverflow.com/questions/53833387/second-order-cooccurrence-of-terms-in-texts

标签

python

matrix

nlp