I have some data that looks something like this:
ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6
...
...
where each row is a group.
M
i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:
import time
class Timer(object):
def __init__(self, desc):
self.desc = desc
def __enter__(self):
self.start = time.time()
def __exit__(self, type, value, traceback):
self.finish = time.time()
print self.desc, 'took', self.finish - self.start
data = list(range(4000000))
with Timer('option 2'):
myset = set()
for x in data: myset.add(x)
with Timer('option 3'):
mylist = list()
for x in data: mylist.append(x)
myset = set(mylist)
the results were surprising to me:
$ python test.py
option 2 took 0.652163028717
option 3 took 0.748883008957
i would have expected at least a 2x speed difference.
Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)
import pprint
import collections
data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''
groups = collections.defaultdict(set)
for row in data.split('\n'):
cols = row.split()
for groupcol in cols:
for col in cols:
if col is not groupcol:
groups[groupcol].add(col)
pprint.pprint(dict(groups))
Results:
{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
'ID2': set(['ID1', 'ID3']),
'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
'ID4': set(['ID1', 'ID5']),
'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
'ID6': set(['ID3', 'ID5', 'ID7']),
'ID7': set(['ID3', 'ID5', 'ID6'])}
TL;DR: Go with option 2. Just use sets from the start.
In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1)
for both, but checking if an element exists is O(n)
for the list and O(1)
for the set.
So option 1 is immediately out. If you are inserting n
items and need to check the list every time, then the overall complexity becomes O(n^2)
.
Options 2 and 3 are both optimal at O(n)
overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.
So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.
All of the functions except for listFilter were in the same ballpark - listFilter was much slower.
import random
import collections
small = [[random.randint(1,25) for _ in range(5)] for i in range(100)]
medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)]
mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)]
bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)]
huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)]
def sets(groups):
results = collections.defaultdict(set)
for group in groups:
for i in group:
for j in group:
if i is not j:
results[i].add(j)
return results
def listToSet(groups):
results = collections.defaultdict(list)
for group in groups:
for i,j in enumerate(group):
results[j] += group[:i] + group[i:]
return {k:set(v) for k, v in results.iteritems()}
def listToSet2(groups):
results = collections.defaultdict(list)
for group in groups:
for i,j in enumerate(group):
results[j] += group
return {k:set(v)-set([k]) for k, v in results.iteritems()}
def sets2(groups):
results = collections.defaultdict(set)
for group in groups:
for i in group:
results[i] |= set(group)
return {k:v - set([k]) for k, v in results.iteritems()}
def listFilter(groups):
results = collections.defaultdict(list)
for group in groups:
for i,j in enumerate(group):
filteredGroup = group[:i] + group[i:]
results[j] += ([k for k in filteredGroup if k not in results[j]])
return results