why is “any()” running slower than using loops?

前端 未结 2 513
再見小時候
再見小時候 2021-01-04 10:25

I\'ve been working in a project that manages big lists of words and pass them trough a lot of tests to validate or not each word of the list. The funny thing is that each ti

2条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-04 10:56

    Since your true question is answered, I'll take a shot at the implied question:

    You can get a free speed boost by just doing unallowed_combinations = sorted(set(unallowed_combinations)), since it contains duplicates.

    Given that, the fastest way I know of doing this is

    valid3_re = re.compile("|".join(map(re.escape, unallowed_combinations)))
    
    def combination_is_valid3(string):
        return not valid3_re.search(string)
    

    With CPython 3.5 I get, for some test data with a line length of 60 characters,

    combination_is_valid ended in 3.3051061630249023 seconds
    combination_is_valid2 ended in 2.216959238052368 seconds
    combination_is_valid3 ended in 1.4767844676971436 seconds
    

    where the third is the regex version, and on PyPy3 I get

    combination_is_valid ended in 2.2926249504089355 seconds
    combination_is_valid2 ended in 2.0935239791870117 seconds
    combination_is_valid3 ended in 0.14300894737243652 seconds
    

    FWIW, this is competitive with Rust (a low-level language, like C++) and actually noticeably wins out on the regex side. Shorter strings favour PyPy over CPython a lot more (eg. 4x CPython for a line length of 10) since overhead is more important then.

    Since only about a third of CPython's regex runtime is loop overhead, we conclude that PyPy's regex implementation is better optimized for this use-case. I'd recommend looking to see if there is a CPython regex implementation that makes this competitive with PyPy.

提交回复
热议问题