Finding whether a string starts with one of a list's variable-length prefixes

瘦欲@ 提交于 2019-11-30 11:02:43

A bit hard to read, but this works:

name=name[len(filter(name.startswith,prefixes+[''])[0]):]

str.startswith(prefix[, start[, end]])¶

Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: prefixes = ("i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_")

In [2]: 'test'.startswith(prefixes)
Out[2]: False

In [3]: 'i_'.startswith(prefixes)
Out[3]: True

In [4]: 'd_a'.startswith(prefixes)
Out[4]: True
for prefix in prefixes:
    if name.startswith(prefix):
        name=name[len(prefix):]
        break

If you define prefix to be the characters before an underscore, then you can check for

if name.partition("_")[0] in ["i", "c", "m", "l", "d", "t", "e", "b", "foo"] and name.partition("_")[1] == "_":
    name = name.partition("_")[2]

Regexes will likely give you the best speed:

prefixes = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_", "also_longer_"]
re_prefixes = "|".join(re.escape(p) for p in prefixes)

m = re.match(re_prefixes, my_string)
if m:
    my_string = my_string[m.end()-m.start():]

What about using filter?

prefs = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
name = list(filter(lambda item: not any(item.startswith(prefix) for prefix in prefs), name))

Note that the comparison of each list item against the prefixes efficiently halts on the first match. This behaviour is guaranteed by the any function that returns as soon as it finds a True value, eg:

def gen():
    print("yielding False")
    yield False
    print("yielding True")
    yield True
    print("yielding False again")
    yield False

>>> any(gen()) # last two lines of gen() are not performed
yielding False
yielding True
True

Or, using re.match instead of startswith:

import re
patt = '|'.join(["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"])
name = list(filter(lambda item: not re.match(patt, item), name))

When it comes to search and efficiency always thinks of indexing techniques to improve your algorithms. If you have a long list of prefixes you can use an in-memory index by simple indexing the prefixes by the first character into a dict.

This solution is only worth if you had a long list of prefixes and performance becomes an issue.

pref = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]

#indexing prefixes in a dict. Do this only once.
d = dict()
for x in pref:
        if not x[0] in d:
                d[x[0]] = list()
        d[x[0]].append(x)


name = "c_abcdf"

#lookup in d to only check elements with the same first character.
result = filter(lambda x: name.startswith(x),\
                        [] if name[0] not in d else d[name[0]])
print result

Regex, tested:

import re

def make_multi_prefix_matcher(prefixes):
    regex_text = "|".join(re.escape(p) for p in prefixes)
    print repr(regex_text)
    return re.compile(regex_text).match

pfxs = "x ya foobar foo a|b z.".split()
names = "xenon yadda yeti food foob foobarre foo a|b a b z.yx zebra".split()

matcher = make_multi_prefix_matcher(pfxs)
for name in names:
    m = matcher(name)
    if not m:
        print repr(name), "no match"
        continue
    n = m.end()
    print repr(name), n, repr(name[n:])

Output:

'x|ya|foobar|foo|a\\|b|z\\.'
'xenon' 1 'enon'
'yadda' 2 'dda'
'yeti' no match
'food' 3 'd'
'foob' 3 'b'
'foobarre' 6 're'
'foo' 3 ''
'a|b' 3 ''
'a' no match
'b' no match
'z.yx' 2 'yx'
'zebra' no match

This edits the list on the fly, removing prefixes. The break skips the rest of the prefixes once one is found for a particular item.

items = ['this', 'that', 'i_blah', 'joe_cool', 'what_this']
prefixes = ['i_', 'c_', 'a_', 'joe_', 'mark_']

for i,item in enumerate(items):
    for p in prefixes:
        if item.startswith(p):
            items[i] = item[len(p):]
            break

print items

Output

['this', 'that', 'blah', 'cool', 'what_this']

Could use a simple regex.

import re
prefixes = ("i_", "c_", "longer_")
re.sub(r'^(%s)' % '|'.join(prefixes), '', name)

Or if anything preceding an underscore is a valid prefix:

name.split('_', 1)[-1]

This removes any number of characters before the first underscore.

import re

def make_multi_prefix_replacer(prefixes):
    if isinstance(prefixes,str):
        prefixes = prefixes.split()
    prefixes.sort(key = len, reverse=True)
    pat = r'\b(%s)' % "|".join(map(re.escape, prefixes))
    print 'regex patern :',repr(pat),'\n'
    def suber(x, reg = re.compile(pat)):
        return reg.sub('',x)
    return suber



pfxs = "x ya foobar yaku foo a|b z."
replacer = make_multi_prefix_replacer(pfxs)               

names = "xenon yadda yeti yakute food foob foobarre foo a|b a b z.yx zebra".split()
for name in names:
    print repr(name),'\n',repr(replacer(name)),'\n'

ss = 'the yakute xenon is a|bcdf in the barfoobaratu foobarii'
print '\n',repr(ss),'\n',repr(replacer(ss)),'\n'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!