Pythonic way to merge two overlapping lists, preserving order

后端 未结 8 1591
鱼传尺愫
鱼传尺愫 2021-02-03 22:40

Alright, so I have two lists, as such:

  • They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
  • The
相关标签:
8条回答
  • 2021-02-03 22:49

    All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by @JuniorCompressor and @TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).

    I found using pandas to concatenate lists with large size is much more time-efficient:

    import pandas as pd, numpy as np
    
    trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )
    
    testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )
    
    # row-wise concatenation for two pandas
    compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)
    
    mergedCompIds = np.array(compoundIdMaps["compoundId"])
    
    0 讨论(0)
  • 2021-02-03 22:53

    I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.

    def longest_common_substring(s1, s2):
        m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
        longest, x_longest = 0, 0
        for x in xrange(1, 1 + len(s1)):
            for y in xrange(1, 1 + len(s2)):
                if s1[x - 1] == s2[y - 1]:
                    m[x][y] = m[x - 1][y - 1] + 1
                    if m[x][y] > longest:
                        longest = m[x][y]
                        x_longest = x
                else:
                    m[x][y] = 0
        return x_longest - longest, x_longest
    
    master = [1,3,9,8,3,4,5]
    addition = [3,4,5,7,8]
    s, e = longest_common_substring(master, addition)
    if e - s > 1:
        print master[:s] + addition
    
    master = [9, 1, 1, 8, 7]
    addition = [8, 6, 7]
    s, e = longest_common_substring(master, addition)
    if e - s > 1:
        print master[:s] + addition
    else:
        print master + addition
    
    [1, 3, 9, 8, 3, 4, 5, 7, 8]
    [9, 1, 1, 8, 7, 8, 6, 7]
    
    0 讨论(0)
  • 2021-02-03 22:55

    First of all and for clarity, you can replace your while loop with a for loop:

    def merge(master, addition):
        for n in xrange(1, len(master)):
            if master[-n:] == addition[:n]:
                return master + addition[n:]
        return master + addition
    

    Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:

    def merge(master, addition):
        indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
        for n in indices:
            if master[-n:] == addition[:n]:
                return master + addition[n:]
        return master + addition
    

    So instead of comparing slices like this:

    1234123141234
                3579
               3579
              3579
             3579
            3579
           3579
          3579
         3579
        3579
       3579
      3579
     3579
    3579
    

    you are only doing these comparisons:

    1234123141234
      |   |    |
      |   |    3579
      |   3579
      3579
    

    How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.

    You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.

    0 讨论(0)
  • 2021-02-03 22:56

    This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.

    def merge(a, b):
        max_offset = len(b)  # can't overlap with greater size than len(b)
        for i in reversed(range(max_offset+1)):
            # checks for equivalence of decreasing sized slices
            if a[-i:] == b[:i]:
                break
        return a + b[i:]
    

    We can test with your test data by doing:

    test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
                 {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
    
    all(merge(test['a'], test['b']) == test['result'] for test in test_data)
    

    This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)

    Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case

    def merge(a, b):
        if a[-1] not in b:
            return a + b
        ...
    

    In fact you could take that solution one step further and probably make your algorithm much faster

    def merge(a, b):
        while True:
            try:
                idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
            except ValueError:  # a[-1] not in b
                return a + b
            if a[-idx:] == b[:idx]:
                return a + b[:idx]
    

    However this might not find the longest overlap in cases like:

    a = [1,2,3,4,1,2,3,4]
    b = [3,4,1,2,3,4,5,6]
    # result should be [1,2,3,4,1,2,3,4,5,6], but
    # this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
    

    You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.

    def merge(a, b):
        results = []
        while True:
            try:
                idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
            except ValueError:  # a[-1] not in b
                results.append(a + b)
                break
            if a[-idx:] == b[:idx]:
                results.append(a + b[:idx])
        return min(results, key=len)
    

    Which should work since merging the longest overlap should produce the shortest result in all cases.

    0 讨论(0)
  • 2021-02-03 23:01

    You can try the following:

    >>> a = [1, 3, 9, 8, 3, 4, 5]
    >>> b = [3, 4, 5, 7, 8]
    
    >>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
    >>> i = next(matches, 0)
    >>> a + b[i:]
    [1, 3, 9, 8, 3, 4, 5, 7, 8]
    

    The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.

    0 讨论(0)
  • 2021-02-03 23:06

    One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.

    Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:

    def merge(master, addition):
        overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
        for overlap_len in overlap_lens:
            for i in range(overlap_len):
                if master[-overlap_len + i] != addition[i]:
                    break
            else:
                return master + addition[overlap_len:]
        return master + addition
    

    The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.

    The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.

    It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.

    0 讨论(0)
提交回复
热议问题