Calculating distance between word/document vectors from a nested dictionary

谁都会走 提交于 2019-12-23 06:45:15

问题


I have a nested dictionary as such:

myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 
          'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 
          'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 
          'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 
          'an': {1:3, 2:15, 3:1, 4:312, 5:100}}
  • The outer key is a word,
  • the inner keys are file/document ids
  • the values are the number of times the word (outer key occurs)

How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1, I should get:

2^2 + 27^2 + 0^2 + 1^2 + 3^2

because the inner key 1 appears 2 times in 'a', 27 times in 'about', 0 times in apple, 1 time in 'anticipate' and 3 times in 'an'


Given the nested dictionary object how do I find the distance between a pair of files/documents?

For example, the distance between the file/document id 1 and 2 would be calculate as such:

doc1 =  {'a':2, 'about':27, 'apple':0, 'anticipate':1, 'an':3} # (i.e. inner key `1`)
doc2 =  {'a':163, 'about':45, 'apple':5, 'anticipate':5, 'an':15} # (i.e. inner key `1`)

I want to know how different/similar the documents are, so how do I get a single floating number as a distance score for the two documents?

How do I calculate the dot product across these two documents?

I've tried calculating a single value for each document by considering:

((2*0) + (27*0) + (3*1) + (1*1) + (0*1)) / (magnitude of file vector * magnitude of search phrase vector)

Using my code as such:

vecDist = {}
    for word in search:
        for fileNum in myDict.iteritems():
            vecDist[fileNum] = "dotproduct" / magnitudeFileVec[fileNum] * magnitudeSearchVec

回答1:


The first bit is easy enough. You want to build up a dictionary containing file numbers, and the sum of the squares of the values for each file number, something like this (untested) should do it:

fileVectors = {}

for wordDict in myDict.itervalues():
    for fileNumber, wordCount in wordDict.iteritems():
        fileVectors[fileNumber] = fileVectors.get(fileNumber, 0) + (wordCount ** 2)



回答2:


Firstly, your dictionary of dictionary is a nice start for what you're doing but it's too convoluted try using a numpy array:

import numpy as np

vocabulary = ['a', 'about', 'apple', 'anticipate', 'an']
matrix = [[2,27, 0, 1, 3], [163, 45, 5, 5, 15], [12, 21, 0, 0, 1], [67, 10, 10, 8, 312], [84, 15, 0, 7, 100]]

matrix = np.array(matrix)

print matrix 

[out]:

[[  2  27   0   1   3]
 [163  45   5   5  15]
 [ 12  21   0   0   1]
 [ 67  10  10   8 312]
 [ 84  15   0   7 100]]

Now you can clearly see that that you rows are documents and your columns are word counts.

To access the term/word vector (i.e. the column):

for i, term in enumerate(vocabulary):
    vector = matrix[:,i]
    print term, vector, vector.sum()

[out]:

a [  2 163  12  67  84] 328
about [27 45 21 10 15] 118
apple [ 0  5  0 10  0] 15
anticipate [1 5 0 8 7] 21
an [  3  15   1 312 100] 431

To access the document vector (i.e. the row):

for i, document in enumerate(matrix):
    print i, document

[out]:

0 [ 2 27  0  1  3]
1 [163  45   5   5  15]
2 [12 21  0  0  1]
3 [ 67  10  10   8 312]
4 [ 84  15   0   7 100]

To access individual row:

doc1 = matrix[0,:]
doc2 = matrix[1,:]

print doc1
print doc2

[out]:

[ 2 27  0  1  3]
[163  45   5   5  15]

To calculate sum of square values in a vector:

`np.sum(doc1**2)`

[out]:

743

To calculate the dot product between two vector, simply:

print np.dot(doc1, doc2)

[out]:

1591

If you're totally stuck with the nested dictionaries, here's how to convert it into a numpy array:

import numpy as np

myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 
          'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 
          'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 
          'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 
          'an': {1:3, 2:15, 3:1, 4:312, 5:100}}

vocabulary = myDict.keys()
matrix = [[myDict[i][j] for j in myDict[i]] for i in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)

print matrix

[out]:

[[  2  27   0   1   3]
 [163  45   5   5  15]
 [ 12  21   0   0   1]
 [ 67  10  10   8 312]
 [ 84  15   0   7 100]]


来源:https://stackoverflow.com/questions/27027680/calculating-distance-between-word-document-vectors-from-a-nested-dictionary

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!