Python string pattern recognition/compression

前端未结

关注

 6  920

隐瞒了意图╮

I can do basic regex alright, but this is slightly different, namely I don\'t know what the pattern is going to be.

For example, I have a list of similar strings:

相关标签:

6条回答

小蘑菇

2021-02-15 15:24

I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.

I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2021-02-15 15:32

This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.

I assume you have no a priori knowledge of these sub strings that repeat often.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2021-02-15 15:35

This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.

The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.

0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2021-02-15 15:35

Here is a scary one to get the ball rolling.

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

I hope its sheer ugliness will inspire better solutions :)

0 讨论(0)

夕颜

2021-02-15 15:42

This solution finds the two longest common substrings and uses them to delimit the input strings:

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters and friends finds the delimiters:

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings splits the strings using the delimiters:

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

0 讨论(0)

难免孤独

2021-02-15 15:46

How about subbing out the known text, and then splitting?

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

0 讨论(0)