Better to add item to a set, or convert final list to a set?

后端 未结 4 2089
北海茫月
北海茫月 2021-02-20 16:47

I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...  

where each row is a group.

M

相关标签:
4条回答
  • 2021-02-20 17:11

    i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:

    import time
    
    class Timer(object):
      def __init__(self, desc):
        self.desc = desc
      def __enter__(self):
        self.start = time.time()
      def __exit__(self, type, value, traceback):
        self.finish = time.time()
        print self.desc, 'took', self.finish - self.start
    
    data = list(range(4000000))
    
    with Timer('option 2'):
      myset = set()
      for x in data: myset.add(x)
    
    with Timer('option 3'):
      mylist = list()
      for x in data: mylist.append(x)
      myset = set(mylist)
    

    the results were surprising to me:

    $ python test.py 
    option 2 took 0.652163028717
    option 3 took 0.748883008957
    

    i would have expected at least a 2x speed difference.

    0 讨论(0)
  • 2021-02-20 17:24

    Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

    import pprint
    import collections
    
    data = '''ID1 ID2 ID3
    ID1 ID4 ID5
    ID3 ID5 ID7 ID6'''
    
    groups = collections.defaultdict(set)
    
    for row in data.split('\n'):
        cols = row.split()
        for groupcol in cols:
            for col in cols:
                if col is not groupcol:
                    groups[groupcol].add(col)
    
    pprint.pprint(dict(groups))
    

    Results:

    {'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
     'ID2': set(['ID1', 'ID3']),
     'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
     'ID4': set(['ID1', 'ID5']),
     'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
     'ID6': set(['ID3', 'ID5', 'ID7']),
     'ID7': set(['ID3', 'ID5', 'ID6'])}
    
    0 讨论(0)
  • 2021-02-20 17:29

    TL;DR: Go with option 2. Just use sets from the start.

    In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

    So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

    Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

    0 讨论(0)
  • 2021-02-20 17:35

    So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.

    All of the functions except for listFilter were in the same ballpark - listFilter was much slower.

    import random
    import collections
    
    small = [[random.randint(1,25) for _ in range(5)] for i in range(100)]
    medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)]
    mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)]
    bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)]
    huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)]
    
    def sets(groups):
        results = collections.defaultdict(set)
        for group in groups:
            for i in group:
                for j in group:
                    if i is not j:
                        results[i].add(j)
        return results
    
    def listToSet(groups):
        results = collections.defaultdict(list)
        for group in groups:
            for i,j in enumerate(group):
                results[j] += group[:i] + group[i:]
        return {k:set(v) for k, v in results.iteritems()}
    
    def listToSet2(groups):
        results = collections.defaultdict(list)
        for group in groups:
            for i,j in enumerate(group):
                results[j] += group
        return {k:set(v)-set([k]) for k, v in results.iteritems()}
    
    def sets2(groups):
        results = collections.defaultdict(set)
        for group in groups:
            for i in group:
                results[i] |= set(group)
        return {k:v - set([k]) for k, v in results.iteritems()}
    
    def listFilter(groups):
        results = collections.defaultdict(list)
        for group in groups:
            for i,j in enumerate(group):
                filteredGroup = group[:i] + group[i:]
                results[j] += ([k for k in filteredGroup if k not in results[j]])
        return results
    
    0 讨论(0)
提交回复
热议问题