Python - Removing overlapping lists

后端 未结 5 1043
旧时难觅i
旧时难觅i 2020-12-03 14:54

Say I have a list of lists that has indexes [[start, end], [start1, end1], [start2, end2]].

Like for example :

[[0, 133], [78, 100], [25,

相关标签:
5条回答
  • 2020-12-03 14:57

    In general, two intervals are overlapping if:

    min([upperBoundOfA, upperBoundOfB]) >= max([lowerBoundOfA, lowerBoundOfB])
    

    If this is the case, the union of those intervals is:

    (min([lowerBoundOfA, lowerBoundOfB]), max([upperBoundOfA, upperBoundOfB])
    

    Similarly, the intersection of those intervals will be:

    (min([upperBoundOfA, upperBoundOfB]), max([lowerBoundOfA, lowerBoundOfB]))
    
    0 讨论(0)
  • 2020-12-03 14:58

    I think one problem in your code is that it does not handle with the situation where one list contains another. For example, [0,100] contains [9,10]. When you loop n in [0,100] and n enters [9,10], the condition statement if n in range(x[i+1][0], x[i+1][1]) is triggered. Then the builtin function max will compare [0, 100] and [9, 10], and unfortuantely max will return [9,10] because it compare the first number in the list. Thus you remove the wrong element.

    I'm trying another way to achieve the effect you want. Rather than manipulating the list itself, I create a new list. Conditionally appending new element to it if it meets our requirements.

    def cleanup_list(lists):
        ranges = []
        for l in lists:
            to_insert = True
            for i in ranges:
                r = range(i[0],i[1])
                # if l overlaps with i, but l does not contain i
                if l[0] in r or l[1] in r:
                    if (l[1]-l[0]) < len(r):
                        ranges.remove(i)
                    else:
                        to_insert = False
                # l contains i
                if l[0]<i[0] and l[1]>i[1]:
                    to_insert = False
            if to_insert:
                ranges.append(l)
        return ranges
    
    0 讨论(0)
  • 2020-12-03 15:07

    This may not be the fastest solution, but really verbose and clear - I think.

    a = [[2,100], [4,10], [77,99], [38,39], [44,80], [69,70], [88, 90]]
    
    # build ranges first
    def expand(list):
        newList = []
        for r in list:
            newList.append(range(r[0], r[1] + 1))
        return newList
    
    
    def compare(list):
        toBeDeleted = []
        for index1 in range(len(list)):
            for index2 in range(len(list)):
                if index1 == index2:
                    # we dont want to compare ourselfs
                    continue
                matches = [x for x in list[index1] if x in list[index2]]
                if len(matches) != 0: # do we have overlap?
                    ## compare lengths and get rid of the longer one
                    if   len(list[index1]) > len(list[index2]):
                        toBeDeleted.append(index1)
                        break
                    elif len(list[index1]) < len(list[index2]):
                        toBeDeleted.append(index2)
        # distinct
        toBeDeleted = [ toBeDeleted[i] for i,x in enumerate(toBeDeleted) if x not in toBeDeleted[i+1:]] 
        print len(list)
        # remove items
        for i in toBeDeleted[::-1]:
            del list[i] 
        return list
    
    
    print(compare(expand(a)))
    
    0 讨论(0)
  • 2020-12-03 15:17
    1. ascending sort all items by length.

    2. add them to a segment tree and ignore overlapped items.

    0 讨论(0)
  • 2020-12-03 15:21

    To remove a minimal number of intervals from the list such that the intervals that are left do not overlap, O(n*log n) algorithm exists:

    def maximize_nonoverlapping_count(intervals):
        # sort by the end-point
        L = sorted(intervals, key=lambda (start, end): (end, (end - start)),
                   reverse=True) # O(n*logn)
        iv = build_interval_tree(intervals) # O(n*log n)
        result = []
        while L: # until there are intervals left to consider
            # pop the interval with the smallest end-point, keep it in the result
            result.append(L.pop()) # O(1)
            # remove intervals that overlap with the popped interval
            overlapping_intervals = iv.pop(result[-1]) # O(log n + m)
            remove(overlapping_intervals, from_=L) 
        return result
    

    It should produce the following results:

    f = maximize_nonoverlapping_count
    assert f([[0, 133], [78, 100], [25, 30]]) == [[25, 30], [78, 100]]
    assert f([[0,100],[9,10],[12,90]]) == [[9,10], [12, 90]]
    assert f([[0, 100], [4, 20], [30, 35], [30, 78]]) == [[4, 20], [30, 35]]
    assert f([[30, 70], [25, 40]]) == [[25, 40]]
    

    It requires the data structure that can find in O(log n + m) time all intervals that overlap with the given interval e.g., IntervalTree. There are implementations that can be used from Python e.g., quicksect.py, see Fast interval intersection methodologies for the example usage.


    Here's a quicksect-based O(n**2) implementation of the above algorithm:

    from quicksect import IntervalNode
    
    class Interval(object):
        def __init__(self, start, end):
            self.start = start
            self.end = end
            self.removed = False
    
    def maximize_nonoverlapping_count(intervals):
        intervals = [Interval(start, end) for start, end in intervals]
        # sort by the end-point
        intervals.sort(key=lambda x: (x.end, (x.end - x.start)))   # O(n*log n)
        tree = build_interval_tree(intervals) # O(n*log n)
        result = []
        for smallest in intervals: # O(n) (without the loop body)
            # pop the interval with the smallest end-point, keep it in the result
            if smallest.removed:
                continue # skip removed nodes
            smallest.removed = True
            result.append([smallest.start, smallest.end]) # O(1)
    
            # remove (mark) intervals that overlap with the popped interval
            tree.intersect(smallest.start, smallest.end, # O(log n + m)
                           lambda x: setattr(x.other, 'removed', True))
        return result
    
    def build_interval_tree(intervals):
        root = IntervalNode(intervals[0].start, intervals[0].end,
                            other=intervals[0])
        return reduce(lambda tree, x: tree.insert(x.start, x.end, other=x),
                      intervals[1:], root)
    

    Note: the time complexity in the worst case is O(n**2) for this implementation because the intervals are only marked as removed e.g., imagine such input intervals that len(result) == len(intervals) / 3 and there were len(intervals) / 2 intervals that span the whole range then tree.intersect() would be called n/3 times and each call would execute x.other.removed = True at least n/2 times i.e., n*n/6 operations in total:

    n = 6
    intervals = [[0, 100], [0, 100], [0, 100], [0, 10], [10, 20], [15, 40]])
    result = [[0, 10], [10, 20]]
    

    Here's a banyan-based O(n log n) implementation:

    from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan
    
    def maximize_nonoverlapping_count(intervals):
        # sort by the end-point O(n log n)
        sorted_intervals = SortedSet(intervals,
                                     key=lambda (start, end): (end, (end - start)))
        # build "interval" tree O(n log n)
        tree = SortedSet(intervals, updator=OverlappingIntervalsUpdator)
        result = []
        while sorted_intervals: # until there are intervals left to consider
            # pop the interval with the smallest end-point, keep it in the result
            result.append(sorted_intervals.pop()) # O(log n)
    
            # remove intervals that overlap with the popped interval
            overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
            tree -= overlapping_intervals # O(m log n)
            sorted_intervals -= overlapping_intervals # O(m log n)
        return result
    

    Note: this implementation considers [0, 10] and [10, 20] intervals to be overlapping:

    f = maximize_nonoverlapping_count
    assert f([[0, 100], [0, 10], [11, 20], [15, 40]]) == [[0, 10] ,[11, 20]]
    assert f([[0, 100], [0, 10], [10, 20], [15, 40]]) == [[0, 10] ,[15, 40]]
    

    sorted_intervals and tree can be merged:

    from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan
    
    def maximize_nonoverlapping_count(intervals):
        # build "interval" tree sorted by the end-point O(n log n)
        tree = SortedSet(intervals, key=lambda (start, end): (end, (end - start)),
                         updator=OverlappingIntervalsUpdator)
        result = []
        while tree: # until there are intervals left to consider
            # pop the interval with the smallest end-point, keep it in the result
            result.append(tree.pop()) # O(log n)
    
            # remove intervals that overlap with the popped interval
            overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
            tree -= overlapping_intervals # O(m log n)
        return result
    
    0 讨论(0)
提交回复
热议问题