Algorithm for linear pattern matching?

后端 未结 7 1980
失恋的感觉
失恋的感觉 2021-02-03 15:49

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101,

7条回答
  •  日久生厌
    2021-02-03 16:16

    If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.

    The code looks like this:

    try:
        import psyco
        psyco.full()
    except ImportError:
        pass
    
    def encode(s):
        def calc_count(count, c):
            return count * (-1 if c == '0' else 1)
        result = []
        c = s[0]
        count = 1
        for i in range(1, len(s)):
            d = s[i]
            if d == c:
                count += 1
            else:
                result.append(calc_count(count, c))
                count = 1
                c = d
        result.append(calc_count(count, c))
        return result
    
    def search(encoded_source, targets):
        def match(encoded_source, t, max_search_len, len_source):
            x = len(t)-1
            # Get the indexes of the longest segments and search them first
            most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
            # Align the signs of the source and target
            index = (0 if encoded_source[0] * t[0] > 0 else 1)
            unencoded_pos = sum(abs(c) for c in encoded_source[:index])
            start_t, end_t = abs(t[0]), abs(t[x])
            for i in range(index, len(encoded_source)-x, 2):
                if all(t[j] == encoded_source[j+i] for j in most_restrictive):
                    encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
                    if start_t <= encoded_start and end_t <= encoded_end:
                        return unencoded_pos + (abs(encoded_source[i]) - start_t)
                unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
                if unencoded_pos > max_search_len:
                    return len_source
            return len_source
        len_source = sum(abs(c) for c in encoded_source)
        i, found, target_index = len_source, None, -1
        for j, t in enumerate(targets):
            x = match(encoded_source, t, i, len_source)
            print "Match at: ", x
            if x < i:
                i, found, target_index = x, t, j
        return (i, found, target_index)
    
    
    if __name__ == "__main__":
        import datetime
        def make_source_text(len):
            from random import randint
            item_len = 8
            item_count = 2**item_len
            table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
            return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
        targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
        encoded_targets = [encode(t) for t in targets]
        data_len = 10*1000*1000
        s = datetime.datetime.now()
        source_text = make_source_text(data_len)
        e = datetime.datetime.now()
        print "Make source text(length %d): " % data_len,  (e - s)
        s = datetime.datetime.now()
        encoded_source = encode(source_text)
        e = datetime.datetime.now()
        print "Encode source text: ", (e - s)
    
        s = datetime.datetime.now()
        (i, found, target_index) = search(encoded_source, encoded_targets)
        print (i, found, target_index)
        print "Target was: ", targets[target_index]
        print "Source matched here: ", source_text[i:i+len(targets[target_index])]
        e = datetime.datetime.now()
        print "Search time: ", (e - s)
    

    On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.

    psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:

    Make source text(length 10000000):  0:00:02.277000
    Encode source text:  0:00:00.329000
    Match at:  2517905
    Match at:  494990
    Match at:  450986
    (450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
    Target was:  1010010001010100100010
    Source matched here:  1010010001010100100010
    Search time:  0:00:04.325000
    

    It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

提交回复
热议问题