I\'ve been working in a project that manages big lists of words and pass them trough a lot of tests to validate or not each word of the list. The funny thing is that each ti
Since your true question is answered, I'll take a shot at the implied question:
You can get a free speed boost by just doing unallowed_combinations = sorted(set(unallowed_combinations))
, since it contains duplicates.
Given that, the fastest way I know of doing this is
valid3_re = re.compile("|".join(map(re.escape, unallowed_combinations)))
def combination_is_valid3(string):
return not valid3_re.search(string)
With CPython 3.5 I get, for some test data with a line length of 60 characters,
combination_is_valid ended in 3.3051061630249023 seconds
combination_is_valid2 ended in 2.216959238052368 seconds
combination_is_valid3 ended in 1.4767844676971436 seconds
where the third is the regex version, and on PyPy3 I get
combination_is_valid ended in 2.2926249504089355 seconds
combination_is_valid2 ended in 2.0935239791870117 seconds
combination_is_valid3 ended in 0.14300894737243652 seconds
FWIW, this is competitive with Rust (a low-level language, like C++) and actually noticeably wins out on the regex side. Shorter strings favour PyPy over CPython a lot more (eg. 4x CPython for a line length of 10) since overhead is more important then.
Since only about a third of CPython's regex runtime is loop overhead, we conclude that PyPy's regex implementation is better optimized for this use-case. I'd recommend looking to see if there is a CPython regex implementation that makes this competitive with PyPy.