sorted() using generator expressions rather than lists

前端 未结 8 1541
梦如初夏
梦如初夏 2020-11-30 05:31

After seeing the discussion here: Python - generate the time difference I got curious. I also initially thought that a generator is faster than a list, but when it comes to

相关标签:
8条回答
  • 2020-11-30 05:54

    There's a huge benefit. Because sorted doesn't affect the passed in sequence, it has to make a copy of it. If it's making a list from the generator expression, then only one list gets made. If a list comprehension is passed in, then first, that gets built and then sorted makes a copy of it to sort.

    This is reflected in the line

    newlist = PySequence_List(seq);
    

    quoted in Sven Marnach's answer. Essentially, this will unconditionally make a copy of whatever sequence is passed to it.

    0 讨论(0)
  • 2020-11-30 06:01

    Python uses Timsort. Timsort needs to know the total number of elements up front, to compute the minrun parameter. Thus, as Sven reports, the first thing that sorted does when given a generator is to turn it into a list.

    That said, it would be possible to write an incremental version of Timsort, which consumed values from the generator more slowly - you'd just have to fix minrun before starting, and accept the pain of having some unbalanced merges at the end. Timsort works in two phases. The first phase involves a pass through the whole array, identifying runs and doing insertion sort to make runs where the data is unordered. Both run-finding and insertion sort are inherently incremental. The second phase involves a merge of the sorted runs; that would happen exactly as now.

    I don't think there would be a lot of point in this, though. Perhaps it would make memory management easier, because rather than having to read from the generator into a constantly-growing array (as i baselessly assume the current implementation does), you could read each run into a small buffer, then only allocate a final-sized buffer once, at the end. However, this would involve having 2N slots of array in memory at once, whereas a growing array can be done with 1.5N if it doubles when it grows. So, probably not a good idea.

    0 讨论(0)
  • 2020-11-30 06:01

    I also initially thought that a list comprehension is faster than a list

    What do you mean faster than a list? Do you mean faster than an explicit for? For that I will say it depends: The list comprehension is more like a syntactic sugar, but it's very handy when it comes to simple loop.

    but when it comes to sorted() I don't know. Is there any benefit to sending a generator expression to sorted() rather than a list?

    The main difference between List comprehensions and Generator expressions is that the Generator expressions avoid the overhead of generating the entire list at once. Instead, they return a generator object which can be iterated one by one, so the Generator expressions are more likely used to save memory usage.

    But you have to understand one thing in Python: It's very hard to tell if one way is faster (optimistic) than another way just by looking at it, and if you want to do that you should use timeit for benchmarking (and benchmarking is more complex than just running one timeit on a single machine).

    Read this for more info about some optimization techniques.

    0 讨论(0)
  • 2020-11-30 06:03

    I should just add to Dave Webb's timing answer [I put in what may be an anonymous edit], that when you access an optimized generator directly, it may be much faster; much of the overhead may be the code's creation of a list or generator of its own:

    >>> timeit.timeit("sorted(xrange(1000, 1, -1))", number=10000)
    0.34192609786987305
    >>> timeit.timeit("sorted(range(1000, 1, -1))", number=10000)
    0.4096639156341553
    >>> timeit.timeit("sorted([el for el in xrange(1000, 1, -1)])", number=10000)
    0.6886589527130127
    >>> timeit.timeit("sorted(el for el in xrange(1000, 1, -1))", number=10000)
    0.9492318630218506
    
    0 讨论(0)
  • 2020-11-30 06:09

    There's no way to sort a sequence without knowing all the elements of the sequence, so any generator passed to sorted() is exhausted.

    0 讨论(0)
  • 2020-11-30 06:10

    If performance is important why not process the data as it is yielded by the generator, and apply the ordering over results of the iterations? Of course this could be used only if there is no causal conditioning between iterations (i.e. the data of sorted iteration #[i] is not needed to do any calculation for sorted iteration #[i + 1]). What I am trying to say in this case is that sorting a set of potentially larger structures yielded by the generator might be adding a lot of unnecessary complexity to an ordering which might take place posterior to processing all elements.

    0 讨论(0)
提交回复
热议问题