list comprehension filtering - “the set() trap”

前端 未结 5 1288
野性不改
野性不改 2020-11-30 04:16

A reasonably common operation is to filter one list based on another list. People quickly find that this:

[x for x in list_1 if x          


        
相关标签:
5条回答
  • 2020-11-30 04:50

    From What’s New In Python 3.2:

    Python’s peephole optimizer now recognizes patterns such x in {1, 2, 3} as being a test for membership in a set of constants. The optimizer recasts the set as a frozenset and stores the pre-built constant.

    0 讨论(0)
  • 2020-11-30 04:50

    The basic reason is that a literal really can't change, whereas if it's an expression like set(list_2), it's possible that evaluating the target expression or the iterable of the comprehension could change the value of set(list_2). For instance, if you have

    [f(x) for x in list_1 if x in set(list_2)]
    

    It is possible that f modifies list_2.

    Even for a simple [x for x in blah ...] expression, it's theoretically possible that the __iter__ method of blah could modify list_2.

    I would imagine there is some scope for optimizations, but the current behavior keeps things simpler. If you start adding optimizations for things like "it is only evaluated once if the target expression is a single bare name and the iterable is a builtin list or dict..." you make it much more complicated to figure out what will happen in any given situation.

    0 讨论(0)
  • 2020-11-30 04:51

    So now I'm wondering - why can python 3.x optimize away the set literal to only build once, but not set(list_2)?

    No one's mentioned this issue yet: how do you know set([1,2,3]) and {1, 2, 3} are the same thing?

    >>> import random
    >>> def set(arg):
    ...     return [random.choice(range(5))]
    ... 
    >>> list1 = list(range(5))
    >>> [x for x in list1 if x in set(list1)]
    [0, 4]
    >>> [x for x in list1 if x in set(list1)]
    [0]
    

    You can't shadow a literal; you can shadow set. So before you can consider hoisting, you need to know not just that list1 isn't being affected, you need to be sure that set is what you think it is. Sometimes you can do that, either under restrictive conditions at compile time or more conveniently at runtime, but it's definitely nontrivial.

    It's kind of funny: often when the suggestion of doing optimizations like this comes up, one pushback is that as nice as they are, it makes it harder to reason about what Python performance is going to be like, even algorithmically. Your question provides some evidence for this objection.

    0 讨论(0)
  • 2020-11-30 04:53

    In order to optimize set(list_2), the interpreter needs to prove that list_2 (and all of its elements) does not change between iterations. This is a hard problem in the general case, and it would not surprise me if the interpreter does not even attempt to tackle it.

    On the other hand a set literal cannot change its value between iterations, so the optimization is known to be safe.

    0 讨论(0)
  • 2020-11-30 05:17

    Too long for a comment

    This won't speak to the optimization details or v2 vs. v3 differences. But when I encounter this in some situations, I find making a context manager out of the data object is useful:

    class context_set(set):
        def __enter__(self):
            return self
        def __exit__(self, *args):
            pass
    
    def context_version():
        with context_set(list_2) as s:
            return [x for x in list_1 if x in s]
    

    Using this I see:

    In [180]: %timeit context_version()
    100 loops, best of 3: 17.8 ms per loop
    

    and in some cases, it provides a nice stop-gap between creating the object before the comprehension vs. creating it within the comprehension, and allows custom tear-down code if you want it.

    A more generic version can be made using contextlib.contextmanager. Here's a quick-and-dirty version of what I mean.

    def context(some_type):
        from contextlib import contextmanager
        generator_apply_type = lambda x: (some_type(y) for y in (x,))
        return contextmanager(generator_apply_type)
    

    Then one can do:

    with context(set)(list_2) as s:
        # ...
    

    or just as easily

    with context(tuple)(list_2) as t:
        # ...
    
    0 讨论(0)
提交回复
热议问题