问题
I have a nested dictionary as such:
myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84},
'about': {1:27, 2:45, 3:21, 4:10, 5:15},
'apple': {1:0, 2: 5, 3:0, 4:10, 5:0},
'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7},
'an': {1:3, 2:15, 3:1, 4:312, 5:100}}
- The outer key is a word,
- the inner keys are file/document ids
- the values are the number of times the word (outer key occurs)
How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1
, I should get:
2^2 + 27^2 + 0^2 + 1^2 + 3^2
because the inner key 1
appears 2 times in 'a', 27 times in 'about', 0 times in apple, 1 time in 'anticipate' and 3 times in 'an'
Given the nested dictionary object how do I find the distance between a pair of files/documents?
For example, the distance between the file/document id 1
and 2
would be calculate as such:
doc1 = {'a':2, 'about':27, 'apple':0, 'anticipate':1, 'an':3} # (i.e. inner key `1`)
doc2 = {'a':163, 'about':45, 'apple':5, 'anticipate':5, 'an':15} # (i.e. inner key `1`)
I want to know how different/similar the documents are, so how do I get a single floating number as a distance score for the two documents?
How do I calculate the dot product across these two documents?
I've tried calculating a single value for each document by considering:
((2*0) + (27*0) + (3*1) + (1*1) + (0*1)) / (magnitude of file vector * magnitude of search phrase vector)
Using my code as such:
vecDist = {}
for word in search:
for fileNum in myDict.iteritems():
vecDist[fileNum] = "dotproduct" / magnitudeFileVec[fileNum] * magnitudeSearchVec
回答1:
The first bit is easy enough. You want to build up a dictionary containing file numbers, and the sum of the squares of the values for each file number, something like this (untested) should do it:
fileVectors = {}
for wordDict in myDict.itervalues():
for fileNumber, wordCount in wordDict.iteritems():
fileVectors[fileNumber] = fileVectors.get(fileNumber, 0) + (wordCount ** 2)
回答2:
Firstly, your dictionary of dictionary is a nice start for what you're doing but it's too convoluted try using a numpy
array:
import numpy as np
vocabulary = ['a', 'about', 'apple', 'anticipate', 'an']
matrix = [[2,27, 0, 1, 3], [163, 45, 5, 5, 15], [12, 21, 0, 0, 1], [67, 10, 10, 8, 312], [84, 15, 0, 7, 100]]
matrix = np.array(matrix)
print matrix
[out]:
[[ 2 27 0 1 3]
[163 45 5 5 15]
[ 12 21 0 0 1]
[ 67 10 10 8 312]
[ 84 15 0 7 100]]
Now you can clearly see that that you rows are documents and your columns are word counts.
To access the term/word vector (i.e. the column):
for i, term in enumerate(vocabulary):
vector = matrix[:,i]
print term, vector, vector.sum()
[out]:
a [ 2 163 12 67 84] 328
about [27 45 21 10 15] 118
apple [ 0 5 0 10 0] 15
anticipate [1 5 0 8 7] 21
an [ 3 15 1 312 100] 431
To access the document vector (i.e. the row):
for i, document in enumerate(matrix):
print i, document
[out]:
0 [ 2 27 0 1 3]
1 [163 45 5 5 15]
2 [12 21 0 0 1]
3 [ 67 10 10 8 312]
4 [ 84 15 0 7 100]
To access individual row:
doc1 = matrix[0,:]
doc2 = matrix[1,:]
print doc1
print doc2
[out]:
[ 2 27 0 1 3]
[163 45 5 5 15]
To calculate sum of square values in a vector:
`np.sum(doc1**2)`
[out]:
743
To calculate the dot product between two vector, simply:
print np.dot(doc1, doc2)
[out]:
1591
If you're totally stuck with the nested dictionaries, here's how to convert it into a numpy array:
import numpy as np
myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84},
'about': {1:27, 2:45, 3:21, 4:10, 5:15},
'apple': {1:0, 2: 5, 3:0, 4:10, 5:0},
'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7},
'an': {1:3, 2:15, 3:1, 4:312, 5:100}}
vocabulary = myDict.keys()
matrix = [[myDict[i][j] for j in myDict[i]] for i in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)
print matrix
[out]:
[[ 2 27 0 1 3]
[163 45 5 5 15]
[ 12 21 0 0 1]
[ 67 10 10 8 312]
[ 84 15 0 7 100]]
来源:https://stackoverflow.com/questions/27027680/calculating-distance-between-word-document-vectors-from-a-nested-dictionary