问题
I have a dictionary with values as lists of text values. (ID : [text values]) Below is an excerpt.
data_dictionary = {
52384: ['text2015', 'webnet'],
18720: ['datascience', 'bigdata', 'links'],
82465: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
73120: ['links', 'scientometrics'],
22276: ['text2015', 'webnet'],
97376: ['text2015', 'webnet'],
43424: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
23297: ['links', 'scientometrics'],
45233: ['webnet', 'conference', 'links']
}
I created a default dictionary to show the text values that are unique and their lists of unique keys.
dd = defaultdict(list)
for k, v in dictionary_name.items():
dd[tuple(v)].append(k)
Which gave the resulting list of unique IDs and their text values:
{('text2015', 'webnet'): [52384, 22276, 97376], ('datascience', 'bigdata', 'links'): [18720], ('biological', 'biomedics', 'datamining', 'datamodel', 'semantics'): [82465, 43424], ('links', 'scientometrics'): [73120, 23297]}
)
Each of these keys has a sum which I extract from the sum_dictionary.
def extract_sum(key_id, sum_dictionary):
for k,v in sum_dictionary.items():
if key_id == k:
k_sum = v
return k_sum
The extracted sum dictionary can be found here.
sum_dict = { 52384:1444856137000,18720:1444859841000, 82465:1444856, 22276:1674856137000, 97376:1812856137000,43424:5183856,23297:1614481000, 45233:1276781300}
I want to output files that have one or more similar text values including if one value has more or less of the shared text values. And to get a result that is in the form of:
ID_1 ; ID_2 ; Sum_for_ID_1 ; Sum_for_ID_2 ; [one or more shared text values between ID_1 and ID_2]
where Sum_for_ID_1 < Sum_for_ID_2
45233 ; 52384 ; 1276781300 ; 1444856137000 ; ['webnet']
52384 ; 97376 ; 1444856137000 ; 1812856137000 ; ['text2015', 'webnet']
18720 ; 18720 ; 1444859841000 ; 1444859841000 ; ['datascience','bigdata', 'links']
73120 ; 23297 ; 144481000 ; 1614481000 ; ['links', 'scientometrics']
(per line)
I tried using itertools to find all combinations of all the words in the dictionary values but the iterations take too much time to work out.
I thought about running a set method over the keys as well to find similar values. Any ideas would really help.
回答1:
It's not the full solution to your problem, but part of it, as i believe it solves most of the problem:
In [1]: data_dictionary = {
...: 52384: ['text2015', 'webnet'],
...: 18720: ['datascience', 'bigdata', 'links'],
...: 82465: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
...: 73120: ['links', 'scientometrics'],
...: 22276: ['text2015', 'webnet'],
...: 97376: ['text2015', 'webnet'],
...: 43424: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
...: 23297: ['links', 'scientometrics'],
...: 45233: ['webnet', 'conference', 'links']
...: }
In [2]: from itertools import combinations
...:
...: intersections = []
...:
...: for first, second in combinations(data_dictionary.items(), r=2):
...: intersection = set(first[1]) & set(second[1])
...: if intersection:
...: intersections.append((first[0], second[0], list(intersection)))
...:
In [3]: intersections
Out[3]:
[(52384, 22276, ['webnet', 'text2015']),
(52384, 97376, ['webnet', 'text2015']),
(52384, 45233, ['webnet']),
(18720, 73120, ['links']),
(18720, 23297, ['links']),
(18720, 45233, ['links']),
(82465,
43424,
['semantics', 'datamodel', 'biological', 'biomedics', 'datamining']),
(73120, 23297, ['links', 'scientometrics']),
(73120, 45233, ['links']),
(22276, 97376, ['webnet', 'text2015']),
(22276, 45233, ['webnet']),
(97376, 45233, ['webnet']),
(23297, 45233, ['links'])]
What it does, it creates pairs of every element of your data_dictionary
and then checks if intersections of values is not empty, then it puts that in intersections
array in form of (key1, key2, intersection)
.
I hope that i gave you a quick-start from which you can finish your task.
回答2:
Using the answered example from vishes_shell above, I managed to get most of the desired output. In order to add individual sums, I considered having to rerun the extract sum method which seems non-optimal. So I left it out of the solution as I think up a different path.
for first, second in combinations(data_dictionary.items(), r=2):
intersection = set(first[1]) & set(second[1])
if intersection:
sum1 = extract_sum(first[0], sum_dict)
sum2 = extract_sum(second[0], sum_dict)
if sum1 < sum2:
early =first[0]
late = second[0]
else:
early = second[0]
late = first[0]
filename.write('%d , %d , %s' % (early, late, list(intersection)))
filename.write('\n')
来源:https://stackoverflow.com/questions/53857382/how-can-i-run-a-set-method-over-lists-in-terms-of-dictionary-keys-values-to-fi