问题
I have a dictionary with strings as keys and sets as values. These sets contain integers, which may be common for different items of the dictionary. Indeed, my goal is to compare each item against the rest and find those common numbers in the set values. Once located a pair of items that at least share a number in their value sets (lets say, (key1, val1)
and (key2, val2)
), I want to add an extra item to the dictionary whose key is the concatenation of both keys key1
and key2
(I DO NOT CARE ABOUT THE ORDER), and whose value is a set with the common numbers in val1
and val2
. Additionally, the common numbers in val1
and val2
should be removed from that item, and in the eventual case those sets are left empty, that item should be removed.
Long description, example needed:
I have a dictionary like this one:
db = {"a": {1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}
I am looking for a procedure that returns something like this:
{"ab": {1}, "c":{4}, "cd": {2,3}}
I actually have a solution, but I suspect is quite inefficient, especially because I start creating two lists for the keys and the values of the dictionary, in order to be able to do a nested loop. That is,
db = {"a":{1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}
keys = list(db.keys())
values = list(db.values())
blackList = set() #I plan to add here the indexes to be removed
for i, key in enumerate(keys):
if i in blackList: continue
keysToAdd = []
valuesToAdd = []
for j in range(i+1,len(keys)):
if j in blackList: continue
intersect = values[i].intersection(values[j])
if len(intersect)==0: continue
keysToAdd.append(key+keys[j]) #I don't care about the order
valuesToAdd.append(intersect)
values[i] -= intersect
values[j] -= intersect
if len(values[j])==0:
blackList.add(j)
if len(values[i])==0:
blackList.add(i)
break
#out of j loop
keys.extend(keysToAdd)
values.extend(valuesToAdd)
#finally delete elements in the blackList
for i in sorted(blackList, reverse=True):
del keys[i]
del values[i]
print(dict(zip(keys, values))) #{"ab": {1}, "c":{4}, "cd":{2,3}} #I don't care about the order
I would like to know if my approach can be improved, in order to have a more efficient solution (because I plan to have big and complex input dicts). Is it actually a way of using itertools.combinations
? - Or perhaps a completely different approach to solve this problem. Thanks in advance
回答1:
I think your approach is not overly inefficient, but there is at least one slightly faster way to achieve this. I think this is a good job for defaultdict.
from collections import defaultdict
def deduplicate(db: dict) -> dict:
# Collect all keys which each number appears in by inverting the mapping:
inverse = defaultdict(set) # number -> keys
for key, value in db.items():
for number in value:
inverse[number].add(key)
# Invert the mapping again - this time joining the keys:
result = defaultdict(set) # joined key -> numbers
for number, keys in inverse.items():
new_key = "".join(sorted(keys))
result[new_key].add(number)
return dict(result)
When comparing using your example - this is roughly 6 times faster. I don't doubt there are faster ways to achieve it!
Using itertools.combinations
would be a reasonable approach if for example you knew that each number could show up in at most 2 keys. As this can presumably be larger - you would need to iterate combinations of varying sizes, ending up with more iterations overall.
Note 1: I have sorted the keys before joining into a new key. You mention that you don't mind about the order - but in this solution its important that keys are consistent, to avoid both "abc"
and "bac"
existing with different numbers. The consistent order will also make it easier for you to test.
Note 2: the step where you remove values from the blacklist has a side effect on the input db
- and so running your code twice with the same input db
would produce different results. You could avoid that in your original implementation by performing a deepcopy
:
values = deepcopy(list(db.values())
回答2:
first create an inverted dictionary where each value maps to the keys that contain it. Then, for each value, use combinations to create key pairs that contain it and build the resulting dictionary from those:
from itertools import combinations
db = {"a":{1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}
matches = dict() # { value:set of keys }
for k,vSet in db.items():
for v in vSet: matches.setdefault(v,[]).append(k)
result = dict() # { keypair:set of values }
for v,keys in matches.items():
if len(keys)<2:
result.setdefault(keys[0],set()).add(v)
continue
for a,b in combinations(keys,2):
result.setdefault(a+b,set()).add(v)
print(result) # {'ab': {1}, 'cd': {2, 3}, 'c': {4}}
来源:https://stackoverflow.com/questions/65664562/efficient-way-to-modify-a-dictionary-while-comparing-its-items