Algorithm for linear pattern matching?

后端未结

关注

 7  1974

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101,

相关标签:

7条回答

日久生厌

2021-02-03 16:16

If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.

The code looks like this:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

def encode(s):
    def calc_count(count, c):
        return count * (-1 if c == '0' else 1)
    result = []
    c = s[0]
    count = 1
    for i in range(1, len(s)):
        d = s[i]
        if d == c:
            count += 1
        else:
            result.append(calc_count(count, c))
            count = 1
            c = d
    result.append(calc_count(count, c))
    return result

def search(encoded_source, targets):
    def match(encoded_source, t, max_search_len, len_source):
        x = len(t)-1
        # Get the indexes of the longest segments and search them first
        most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
        # Align the signs of the source and target
        index = (0 if encoded_source[0] * t[0] > 0 else 1)
        unencoded_pos = sum(abs(c) for c in encoded_source[:index])
        start_t, end_t = abs(t[0]), abs(t[x])
        for i in range(index, len(encoded_source)-x, 2):
            if all(t[j] == encoded_source[j+i] for j in most_restrictive):
                encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
                if start_t <= encoded_start and end_t <= encoded_end:
                    return unencoded_pos + (abs(encoded_source[i]) - start_t)
            unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
            if unencoded_pos > max_search_len:
                return len_source
        return len_source
    len_source = sum(abs(c) for c in encoded_source)
    i, found, target_index = len_source, None, -1
    for j, t in enumerate(targets):
        x = match(encoded_source, t, i, len_source)
        print "Match at: ", x
        if x < i:
            i, found, target_index = x, t, j
    return (i, found, target_index)


if __name__ == "__main__":
    import datetime
    def make_source_text(len):
        from random import randint
        item_len = 8
        item_count = 2**item_len
        table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
        return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
    targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
    encoded_targets = [encode(t) for t in targets]
    data_len = 10*1000*1000
    s = datetime.datetime.now()
    source_text = make_source_text(data_len)
    e = datetime.datetime.now()
    print "Make source text(length %d): " % data_len,  (e - s)
    s = datetime.datetime.now()
    encoded_source = encode(source_text)
    e = datetime.datetime.now()
    print "Encode source text: ", (e - s)

    s = datetime.datetime.now()
    (i, found, target_index) = search(encoded_source, encoded_targets)
    print (i, found, target_index)
    print "Target was: ", targets[target_index]
    print "Source matched here: ", source_text[i:i+len(targets[target_index])]
    e = datetime.datetime.now()
    print "Search time: ", (e - s)

On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.

psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:

Make source text(length 10000000):  0:00:02.277000
Encode source text:  0:00:00.329000
Match at:  2517905
Match at:  494990
Match at:  450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was:  1010010001010100100010
Source matched here:  1010010001010100100010
Search time:  0:00:04.325000

It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

0 讨论(0)

闹比i

2021-02-03 16:29

If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.

It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.

Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2021-02-03 16:30

The key for Googling is "multi-pattern" string matching.

Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2021-02-03 16:30
A solution that could be efficient:
1. store the patterns in a trie data structure
2. start searching the list
3. check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
4. step one char and repeat #3
If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2021-02-03 16:33

If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.

Jacob

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-02-03 16:34

Yes.

The Boyer–Moore string search algorithm

See also: Knuth–Morris–Pratt algorithm

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页