Algorithm for linear pattern matching?

后端 未结 7 1974
失恋的感觉
失恋的感觉 2021-02-03 15:49

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101,

相关标签:
7条回答
  • 2021-02-03 16:16

    If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.

    The code looks like this:

    try:
        import psyco
        psyco.full()
    except ImportError:
        pass
    
    def encode(s):
        def calc_count(count, c):
            return count * (-1 if c == '0' else 1)
        result = []
        c = s[0]
        count = 1
        for i in range(1, len(s)):
            d = s[i]
            if d == c:
                count += 1
            else:
                result.append(calc_count(count, c))
                count = 1
                c = d
        result.append(calc_count(count, c))
        return result
    
    def search(encoded_source, targets):
        def match(encoded_source, t, max_search_len, len_source):
            x = len(t)-1
            # Get the indexes of the longest segments and search them first
            most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
            # Align the signs of the source and target
            index = (0 if encoded_source[0] * t[0] > 0 else 1)
            unencoded_pos = sum(abs(c) for c in encoded_source[:index])
            start_t, end_t = abs(t[0]), abs(t[x])
            for i in range(index, len(encoded_source)-x, 2):
                if all(t[j] == encoded_source[j+i] for j in most_restrictive):
                    encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
                    if start_t <= encoded_start and end_t <= encoded_end:
                        return unencoded_pos + (abs(encoded_source[i]) - start_t)
                unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
                if unencoded_pos > max_search_len:
                    return len_source
            return len_source
        len_source = sum(abs(c) for c in encoded_source)
        i, found, target_index = len_source, None, -1
        for j, t in enumerate(targets):
            x = match(encoded_source, t, i, len_source)
            print "Match at: ", x
            if x < i:
                i, found, target_index = x, t, j
        return (i, found, target_index)
    
    
    if __name__ == "__main__":
        import datetime
        def make_source_text(len):
            from random import randint
            item_len = 8
            item_count = 2**item_len
            table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
            return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
        targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
        encoded_targets = [encode(t) for t in targets]
        data_len = 10*1000*1000
        s = datetime.datetime.now()
        source_text = make_source_text(data_len)
        e = datetime.datetime.now()
        print "Make source text(length %d): " % data_len,  (e - s)
        s = datetime.datetime.now()
        encoded_source = encode(source_text)
        e = datetime.datetime.now()
        print "Encode source text: ", (e - s)
    
        s = datetime.datetime.now()
        (i, found, target_index) = search(encoded_source, encoded_targets)
        print (i, found, target_index)
        print "Target was: ", targets[target_index]
        print "Source matched here: ", source_text[i:i+len(targets[target_index])]
        e = datetime.datetime.now()
        print "Search time: ", (e - s)
    

    On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.

    psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:

    Make source text(length 10000000):  0:00:02.277000
    Encode source text:  0:00:00.329000
    Match at:  2517905
    Match at:  494990
    Match at:  450986
    (450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
    Target was:  1010010001010100100010
    Source matched here:  1010010001010100100010
    Search time:  0:00:04.325000
    

    It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

    0 讨论(0)
  • 2021-02-03 16:29

    If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.

    It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.

    Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.

    0 讨论(0)
  • 2021-02-03 16:30

    The key for Googling is "multi-pattern" string matching.

    Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

    An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

    I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

    0 讨论(0)
  • 2021-02-03 16:30

    A solution that could be efficient:

    1. store the patterns in a trie data structure
    2. start searching the list
    3. check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
    4. step one char and repeat #3

    If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.

    0 讨论(0)
  • 2021-02-03 16:33

    If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.

    Jacob

    0 讨论(0)
  • 2021-02-03 16:34

    Yes.

    The Boyer–Moore string search algorithm

    See also: Knuth–Morris–Pratt algorithm

    0 讨论(0)
提交回复
热议问题