How to find the longest common substring of multiple strings?

后端 未结 1 1639
说谎
说谎 2020-12-10 19:22

I am writing a python script where I have multiple strings.

For example:

x = \"brownasdfoersjumps\"
y = \"foxsxzxasis12sa[[#brown\"
z = \"thissasbrow         


        
相关标签:
1条回答
  • 2020-12-10 19:51

    This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.

    from functools import partial, reduce
    from itertools import chain
    from typing import Iterator
    
    
    def ngram(seq: str, n: int) -> Iterator[str]:
        return (seq[i: i+n] for i in range(0, len(seq)-n+1))
    
    
    def allngram(seq: str) -> set:
        lengths = range(len(seq))
        ngrams = map(partial(ngram, seq), lengths)
        return set(chain.from_iterable(ngrams))
    
    
    sequences = ["brownasdfoersjumps",
                 "foxsxzxasis12sa[[#brown",
                 "thissasbrownxc-34a@s;"]
    
    seqs_ngrams = map(allngram, sequences)
    intersection = reduce(set.intersection, seqs_ngrams)
    longest = max(intersection, key=len) # -> brown
    

    While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.

    def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
        lengths = range(minn, maxn) if maxn else range(minn, len(seq))
        ngrams = map(partial(ngram, seq), lengths)
        return set(chain.from_iterable(ngrams))
    
    
    sequences = ["brownasdfoersjumps",
                 "foxsxzxasis12sa[[#brown",
                 "thissasbrownxc-34a@s;"]
    
    maxn = min(map(len, sequences))
    seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
    intersection = reduce(set.intersection, seqs_ngrams)
    longest = max(intersection, key=len)  # -> brown
    

    This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).

    Update

    To count the number of strings wherein each ngram occurs

    from collections import Counter
    sequences = ["brownasdfoersjumps",
                 "foxsxzxasis12sa[[#brown",
                 "thissasbrownxc-34a@s;"]
    
    seqs_ngrams = map(allngram, sequences)
    counts = Counter(chain.from_iterable(seqs_ngrams))
    

    Counter is a subclass of dict, so its instances have similar interfaces:

    print(counts)
    Counter({'#': 1,
             '#b': 1,
             '#br': 1,
             '#bro': 1,
             '#brow': 1,
             '#brown': 1,
             '-': 1,
             '-3': 1,
             '-34': 1,
             '-34a': 1,
             '-34a@': 1,
             '-34a@s': 1,
             '-34a@s;': 1,
             ...
    

    You can filter the counts to leave substrings occurring in at least n strings: {string: count for string, count in counts.items() if count >= n}

    0 讨论(0)
提交回复
热议问题