Pythonic way to merge two overlapping lists, preserving order

后端 未结 8 1594
鱼传尺愫
鱼传尺愫 2021-02-03 22:40

Alright, so I have two lists, as such:

  • They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
  • The
8条回答
  •  再見小時候
    2021-02-03 22:56

    This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.

    def merge(a, b):
        max_offset = len(b)  # can't overlap with greater size than len(b)
        for i in reversed(range(max_offset+1)):
            # checks for equivalence of decreasing sized slices
            if a[-i:] == b[:i]:
                break
        return a + b[i:]
    

    We can test with your test data by doing:

    test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
                 {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
    
    all(merge(test['a'], test['b']) == test['result'] for test in test_data)
    

    This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)

    Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case

    def merge(a, b):
        if a[-1] not in b:
            return a + b
        ...
    

    In fact you could take that solution one step further and probably make your algorithm much faster

    def merge(a, b):
        while True:
            try:
                idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
            except ValueError:  # a[-1] not in b
                return a + b
            if a[-idx:] == b[:idx]:
                return a + b[:idx]
    

    However this might not find the longest overlap in cases like:

    a = [1,2,3,4,1,2,3,4]
    b = [3,4,1,2,3,4,5,6]
    # result should be [1,2,3,4,1,2,3,4,5,6], but
    # this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
    

    You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.

    def merge(a, b):
        results = []
        while True:
            try:
                idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
            except ValueError:  # a[-1] not in b
                results.append(a + b)
                break
            if a[-idx:] == b[:idx]:
                results.append(a + b[:idx])
        return min(results, key=len)
    

    Which should work since merging the longest overlap should produce the shortest result in all cases.

提交回复
热议问题