Pythonic way to merge two overlapping lists, preserving order

后端未结

关注

 8  1612

鱼传尺愫

Alright, so I have two lists, as such:

They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
The

相关标签:

8条回答

离开以前

2021-02-03 22:49

All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by @JuniorCompressor and @TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).

I found using pandas to concatenate lists with large size is much more time-efficient:

import pandas as pd, numpy as np

trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )

testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )

# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)

mergedCompIds = np.array(compoundIdMaps["compoundId"])

0 讨论(0)

一向

2021-02-03 22:53

I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return x_longest - longest, x_longest

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition

master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition
else:
    print master + addition

[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]

0 讨论(0)

独厮守ぢ

2021-02-03 22:55
First of all and for clarity, you can replace your while loop with a for loop:
```
def merge(master, addition):
    for n in xrange(1, len(master)):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition
```
Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:
```
def merge(master, addition):
    indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
    for n in indices:
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition
```
So instead of comparing slices like this:
```
1234123141234
            3579
           3579
          3579
         3579
        3579
       3579
      3579
     3579
    3579
   3579
  3579
 3579
3579
```
you are only doing these comparisons:
```
1234123141234
  |   |    |
  |   |    3579
  |   3579
  3579
```
How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.

You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2021-02-03 22:56
This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.
```
def merge(a, b):
    max_offset = len(b)  # can't overlap with greater size than len(b)
    for i in reversed(range(max_offset+1)):
        # checks for equivalence of decreasing sized slices
        if a[-i:] == b[:i]:
            break
    return a + b[i:]
```
We can test with your test data by doing:
```
test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
             {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]

all(merge(test['a'], test['b']) == test['result'] for test in test_data)
```
This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)

Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case
```
def merge(a, b):
    if a[-1] not in b:
        return a + b
    ...
```
In fact you could take that solution one step further and probably make your algorithm much faster
```
def merge(a, b):
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            return a + b
        if a[-idx:] == b[:idx]:
            return a + b[:idx]
```
However this might not find the longest overlap in cases like:
```
a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
```
You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.
```
def merge(a, b):
    results = []
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            results.append(a + b)
            break
        if a[-idx:] == b[:idx]:
            results.append(a + b[:idx])
    return min(results, key=len)
```
Which should work since merging the longest overlap should produce the shortest result in all cases.
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-02-03 23:01
You can try the following:
```
>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]

>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]
```
The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.
0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2021-02-03 23:06
One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.

Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:
```
def merge(master, addition):
    overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
    for overlap_len in overlap_lens:
        for i in range(overlap_len):
            if master[-overlap_len + i] != addition[i]:
                break
        else:
            return master + addition[overlap_len:]
    return master + addition
```
The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.

The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.

It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页