Better to add item to a set, or convert final list to a set?

后端未结

关注

 4  2120

I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...

where each row is a group.

相关标签:

4条回答

一向

2021-02-20 17:11

i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:

import time

class Timer(object):
  def __init__(self, desc):
    self.desc = desc
  def __enter__(self):
    self.start = time.time()
  def __exit__(self, type, value, traceback):
    self.finish = time.time()
    print self.desc, 'took', self.finish - self.start

data = list(range(4000000))

with Timer('option 2'):
  myset = set()
  for x in data: myset.add(x)

with Timer('option 3'):
  mylist = list()
  for x in data: mylist.append(x)
  myset = set(mylist)

the results were surprising to me:

$ python test.py 
option 2 took 0.652163028717
option 3 took 0.748883008957

i would have expected at least a 2x speed difference.

0 讨论(0)

庸人自扰

2021-02-20 17:24

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint
import collections

data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''

groups = collections.defaultdict(set)

for row in data.split('\n'):
    cols = row.split()
    for groupcol in cols:
        for col in cols:
            if col is not groupcol:
                groups[groupcol].add(col)

pprint.pprint(dict(groups))

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
 'ID2': set(['ID1', 'ID3']),
 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
 'ID4': set(['ID1', 'ID5']),
 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
 'ID6': set(['ID3', 'ID5', 'ID7']),
 'ID7': set(['ID3', 'ID5', 'ID6'])}

0 讨论(0)

忘掉有多难

2021-02-20 17:29

TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2021-02-20 17:35

So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.

All of the functions except for listFilter were in the same ballpark - listFilter was much slower.

import random
import collections

small = [[random.randint(1,25) for _ in range(5)] for i in range(100)]
medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)]
mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)]
bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)]
huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)]

def sets(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            for j in group:
                if i is not j:
                    results[i].add(j)
    return results

def listToSet(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group[:i] + group[i:]
    return {k:set(v) for k, v in results.iteritems()}

def listToSet2(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group
    return {k:set(v)-set([k]) for k, v in results.iteritems()}

def sets2(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            results[i] |= set(group)
    return {k:v - set([k]) for k, v in results.iteritems()}

def listFilter(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            filteredGroup = group[:i] + group[i:]
            results[j] += ([k for k in filteredGroup if k not in results[j]])
    return results

0 讨论(0)