I have some data that looks something like this:
ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6
...
...
where each row is a group.
M
i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:
import time
class Timer(object):
def __init__(self, desc):
self.desc = desc
def __enter__(self):
self.start = time.time()
def __exit__(self, type, value, traceback):
self.finish = time.time()
print self.desc, 'took', self.finish - self.start
data = list(range(4000000))
with Timer('option 2'):
myset = set()
for x in data: myset.add(x)
with Timer('option 3'):
mylist = list()
for x in data: mylist.append(x)
myset = set(mylist)
the results were surprising to me:
$ python test.py
option 2 took 0.652163028717
option 3 took 0.748883008957
i would have expected at least a 2x speed difference.