Why is my MergeSort so slow in Python?

前端 未结 4 1902
春和景丽
春和景丽 2021-01-14 15:20

I\'m having some troubles understanding this behaviour. I\'m measuring the execution time with the timeit-module and get the following results for 10000 cyc

相关标签:
4条回答
  • 2021-01-14 15:54

    For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :

    • Python's native (Tim)sort : 0.0144600081444
    • Bubblesort : 26.9620819092
    • (Your) Original Mergesort : 0.224888720512

    Now, to make your function faster, you can do a few things.

    • Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):

      def nolenmerge(array1,array2):
          merged_array=[]
          while array1 or array2:
              if not array1:
                  merged_array.append(array2.pop(0))
              elif (not array2) or array1[0] < array2[0]:
                  merged_array.append(array1.pop(0))
              else:
                  merged_array.append(array2.pop(0))
          return merged_array
      
      def nolenmergeSort(array):
          n  = len(array)
          if n <= 1:
              return array
          left = array[:n/2]
          right = array[n/2:]
          return nolenmerge(nolenmergeSort(left),nolenmergeSort(right))
      
    • Second, as suggested in this answer, pop(0) is linear. Rewrite your merge to pop() at the end:

      def fastmerge(array1,array2):
          merged_array=[]
          while array1 or array2:
              if not array1:
                  merged_array.append(array2.pop())
              elif (not array2) or array1[-1] > array2[-1]:
                  merged_array.append(array1.pop())
              else:
                  merged_array.append(array2.pop())
          merged_array.reverse()
          return merged_array
      

      This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079

    • Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

      But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):

      def cps_merge_sort(array):
          return cpsmergeSort(array,lambda x:x)
      
      def cpsmergeSort(array,continuation):
          n  = len(array)
          if n <= 1:
              return continuation(array)
          left = array[:n/2]
          right = array[n/2:]
          return cpsmergeSort (left, lambda leftR:
                               cpsmergeSort(right, lambda rightR:
                                            continuation(fastmerge(leftR,rightR))))
      

      Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.

      You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).

      thunk = lambda name, *args: lambda: name(*args)
      

      You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)

      def trampoline(bouncer):
          while callable(bouncer):
              bouncer = bouncer()
          return bouncer
      

      Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:

      def tco_cpsmergeSort(array,continuation):
          n  = len(array)
          if n <= 1:
              return continuation(array)
          left = array[:n/2]
          right = array[n/2:]
          return thunk (tco_cpsmergeSort, left, lambda leftR:
                        thunk (tco_cpsmergeSort, right, lambda rightR:
                               (continuation(fastmerge(leftR,rightR)))))
      
      mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))
      

    Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.

    You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:

        def leftcomb(l):
            maxn,leftcomb = len(l),[]
            n = maxn/2
            while maxn > 1:
                leftcomb.append((l[n:maxn],False))
                maxn,n = n,n/2
            return l[:maxn],leftcomb
    
        def tcomergesort(l):
            l,stack = leftcomb(l)
            while stack: # l sorted, stack contains tagged slices
                i,ordered = stack.pop()
                if ordered:
                    l = fastmerge(l,i)
                else:
                    stack.append((l,True)) # store return call
                    rsub,ssub = leftcomb(i)
                    stack.extend(ssub) #recurse
                    l = rsub
            return l
    

    But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.

    The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

    • Python's native (Tim)sort : 0.0144600081444
    • Bubblesort : 26.9620819092
    • Original Mergesort : 0.224888720512
    • no-len Mergesort : 0.195795390606
    • no-len Mergesort + fastmerge : 0.126505711079
    • trampolined CPS Mergesort + fastmerge : 0.170638551712
    • stack-based mergesort + fastmerge: 0.144994809628
    0 讨论(0)
  • 2021-01-14 15:55

    list.pop(0) pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.

    Also, slicing a list object creates a copy:

    left = array[:len(array)/2]
    right = array[len(array)/2:]
    

    Which means you're also using O(n * log(n)) memory instead of O(n).

    I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.

    You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.

    0 讨论(0)
  • 2021-01-14 15:55

    Your merge-sort has a big constant factor, you have to run it on large lists to see the asymptotic complexity benefit.

    0 讨论(0)
  • 2021-01-14 16:15

    Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps) insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps) merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)

    You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).

    So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.

    That being said..

    There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put: if (len <= 16) : (use inline insertion-sort) else: merge-sort At EACH propagation level.

    Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.

    Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters

    Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.

    In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.

    In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)

    Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.

    Happy coding.

    0 讨论(0)
提交回复
热议问题