Python: Fast extraction of intersections among all possible 2-combinations in a large number of lists

前端 未结 3 576
星月不相逢
星月不相逢 2021-02-06 17:10

I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations

3条回答
  •  太阳男子
    2021-02-06 17:57

    As you need to produce a (N by N/2) matrix of results, i.e., O(N squared) outputs, no approach can be less than O(N squared) -- in any language, of course. (N is "about 9K" in your question). So, I see nothing intrinsically faster than (a) making the N sets you need, and (b) iterating over them to produce the output -- i.e., the simplest approach. IOW:

    def lotsofintersections(manylists):
      manysets = [set(x) for x in manylists]
      moresets = list(manysets)
      for  s in reversed(manysets):
        moresets.pop()
        for z in moresets:
          yield s & z
    

    This code's already trying to add some minor optimization (e.g. by avoiding slicing or popping off the front of lists, which might add other O(N squared) factors).

    If you have many cores and/or nodes available and are looking for parallel algorithms, it's a different case of course -- if that's your case, can you mention the kind of cluster you have, its size, how nodes and cores can best communicate, and so forth?

    Edit: as the OP has casually mentioned in a comment (!) that they actually need the numbers of the sets being intersected (really, why omit such crucial parts of the specs?! at least edit the question to clarify them...), this would only require changing this to:

      L = len(manysets)
      for i, s in enumerate(reversed(manysets)):
        moresets.pop()
        for j, z in enumerate(moresets):
          yield L - i, j + 1, s & z
    

    (if you need to "count from 1" for the progressive identifiers -- otherwise obvious change).

    But if that's part of the specs you might as well use simpler code -- forget moresets, and:

      L = len(manysets)
      for i xrange(L):
        s = manysets[i]
        for j in range(i+1, L):
          yield i, j, s & manysets[z]
    

    this time assuming you want to "count from 0" instead, just for variety;-)

提交回复
热议问题