Find *most* common prefix of strings - a better way?

故事扮演 提交于 2019-12-08 03:24:10

问题


I have a list of keys ['foo_a','foo_b','foo_c','fnord']

All of the similar solutions here assume that you have no fnord's in your text.

I have this code that does the job:

def detect_prefix(keys):
    PCT = 0.70 # cutof 
    pre = ''
    l = len(keys)
    for i in range(0, len(max(keys, key=len))):
        keys = filter(lambda k: k.startswith(pre), keys)
        cnt = dict()
        for k in map(lambda k: k[i], keys):
            cnt.setdefault(k,0)
            cnt[k] +=1
        if cnt[max(cnt)] / float(l) >= PCT:
            pre += max(cnt)
        else:
            break
    return pre

I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.

I would love to hear some suggestions.

Edit. Additional background and clarification.
These are keys that other developers have put in an application to use for translation. They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.


回答1:


If you know the desired threshold frequency for the common prefix:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count < threshold:
        break
    prefix.append(char)
print(''.join(prefix))
# -> foo_

Or you could collect all common prefixes and their frequencies and decide later:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count == 1:
        break
    elif count < threshold:
        if prefix:
            prefixes.append((''.join(prefix), threshold))
        threshold = count
    prefix.append(char)
if prefix:
    prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]

Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.




回答2:


A nice way to find which things have a particular prefix is a trie. I used an implementation called pytrie, but they all work fairly much the same way. The only fun bit is you still need to generate all the prefixes another way, since asking the trie for "all the prefixes of foo_a" only gives you "foo_a" and all the prefix strings of it that are part of your data, but you seem to care about "foo_" even though it isn't a key of its own. However, it can do it the other way around, telling you all the keys that start with a given prefix even if it isn't explicitly stored.

Other than that, its all fairly straightforward. Including the import, it comes in at five lines:

from pytrie import StringTrie as trie

data = trie.fromkeys(['foo_a','foo_b','foo_c','fnord'])
PCT = 0.70 
prefixes = (k[:i] for k in data for i,_ in enumerate(k, start=1))
print(max(filter(lambda x: len(data.keys(x)) >= PCT * len(data), prefixes), key=len))

prints foo_.




回答3:


def det_pref(words):
    cnt =  {'':len(words)}
    for w_pfx in itertools.chain.from_iterable([[w[:i] for i in range(1,len(w)+1)] for  w in words]):
         cnt[w_pfx] = 1 + cnt.get(w_pfx,0)
    return max([w_pfx for (w_pfx,n) in cnt.items() if n/len(words)>0.7])

Caveat: as this solution does no early-out and input-reduction during the loop, it is less efficient than the original code.

Here is a more efficient approach, which is IMHO still pythonic, but harder to understand and longer than my first one:

def det_pref(words):
    cnt = dict()
    pfx_len = [len(w) for w in words]
    while any(pfx_len):
        for i,w_pfx in [(i,words[i][:l]) for i,l in enumerate(pfx_len)]:
            pfx_len[i] -= pfx_len[i] and 1 or 0
            n = 1 + cnt.get(w_pfx,0)
            if n/len(words)> 0.7:
                return w_pfx
            cnt[w_pfx] = n
    return ''


来源:https://stackoverflow.com/questions/22628428/find-most-common-prefix-of-strings-a-better-way

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!