Why is insertion into my tree faster on sorted input than random input?

前端 未结 8 1144
慢半拍i
慢半拍i 2021-02-02 12:33

Now I\'ve always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep t

相关标签:
8条回答
  • 2021-02-02 13:02

    I added calculation of the standard deviation, and changed your test to run at the highest priority (to reduce noise as much as possible). This are the results:

    Random                                   Ordered
    0,2835 (stddev 0,9946)                   0,0891 (stddev 0,2372)
    0,1230 (stddev 0,0086)                   0,0780 (stddev 0,0031)
    0,2498 (stddev 0,0662)                   0,1694 (stddev 0,0145)
    0,5136 (stddev 0,0441)                   0,3550 (stddev 0,0658)
    1,1704 (stddev 0,1072)                   0,6632 (stddev 0,0856)
    1,4672 (stddev 0,1090)                   0,8343 (stddev 0,1047)
    3,3330 (stddev 0,2041)                   1,9272 (stddev 0,3456)
    7,9822 (stddev 0,3906)                   3,7871 (stddev 0,1459)
    18,4300 (stddev 0,6112)                  10,3233 (stddev 2,0247)
    44,9500 (stddev 2,2935)                  22,3870 (stddev 1,7157)
    110,5275 (stddev 3,7129)                 49,4085 (stddev 2,9595)
    275,4345 (stddev 10,7154)                107,8442 (stddev 8,6200)
    667,7310 (stddev 20,0729)                242,9779 (stddev 14,4033)
    

    I've ran a sampling profiler and here are the results (amount of times the program was in this method):

    Method           Random        Ordered
    HeapifyRight()   1.95          5.33
    get_IsEmpty()    3.16          5.49
    Make()           3.28          4.92
    Insert()         16.01         14.45
    HeapifyLeft()    2.20          0.00
    

    Conclusion: the random has a fairly reasonable distribution between left and right rotation, while the ordered never rotates left.

    Here is my improved "benchmark" program:

        static void Main(string[] args)
        {
            Thread.CurrentThread.Priority = ThreadPriority.Highest;
            Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
    
            List<String> rndTimes = new List<String>();
            List<String> orderedTimes = new List<String>();
    
            rndTimes.Add(TimeIt(50, RandomInsert));
            rndTimes.Add(TimeIt(100, RandomInsert));
            rndTimes.Add(TimeIt(200, RandomInsert));
            rndTimes.Add(TimeIt(400, RandomInsert));
            rndTimes.Add(TimeIt(800, RandomInsert));
            rndTimes.Add(TimeIt(1000, RandomInsert));
            rndTimes.Add(TimeIt(2000, RandomInsert));
            rndTimes.Add(TimeIt(4000, RandomInsert));
            rndTimes.Add(TimeIt(8000, RandomInsert));
            rndTimes.Add(TimeIt(16000, RandomInsert));
            rndTimes.Add(TimeIt(32000, RandomInsert));
            rndTimes.Add(TimeIt(64000, RandomInsert));
            rndTimes.Add(TimeIt(128000, RandomInsert));
            orderedTimes.Add(TimeIt(50, OrderedInsert));
            orderedTimes.Add(TimeIt(100, OrderedInsert));
            orderedTimes.Add(TimeIt(200, OrderedInsert));
            orderedTimes.Add(TimeIt(400, OrderedInsert));
            orderedTimes.Add(TimeIt(800, OrderedInsert));
            orderedTimes.Add(TimeIt(1000, OrderedInsert));
            orderedTimes.Add(TimeIt(2000, OrderedInsert));
            orderedTimes.Add(TimeIt(4000, OrderedInsert));
            orderedTimes.Add(TimeIt(8000, OrderedInsert));
            orderedTimes.Add(TimeIt(16000, OrderedInsert));
            orderedTimes.Add(TimeIt(32000, OrderedInsert));
            orderedTimes.Add(TimeIt(64000, OrderedInsert));
            orderedTimes.Add(TimeIt(128000, OrderedInsert));
            var result = string.Join("\n", (from s in rndTimes
                            join s2 in orderedTimes
                                on rndTimes.IndexOf(s) equals orderedTimes.IndexOf(s2)
                            select String.Format("{0} \t\t {1}", s, s2)).ToArray());
            Console.WriteLine(result);
            Console.WriteLine("Done");
            Console.ReadLine();
        }
    
        static double StandardDeviation(List<double> doubleList)
        {
            double average = doubleList.Average();
            double sumOfDerivation = 0;
            foreach (double value in doubleList)
            {
                sumOfDerivation += (value) * (value);
            }
            double sumOfDerivationAverage = sumOfDerivation / doubleList.Count;
            return Math.Sqrt(sumOfDerivationAverage - (average * average));
        }
        static String TimeIt(int insertCount, Action<int> f)
        {
            Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);
    
            List<double> times = new List<double>();
            for (int i = 0; i < ITERATION_COUNT; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                f(insertCount);
                sw.Stop();
                times.Add(sw.Elapsed.TotalMilliseconds);
            }
    
            return String.Format("{0:f4} (stddev {1:f4})", times.Average(), StandardDeviation(times));
        }
    
    0 讨论(0)
  • 2021-02-02 13:06

    You're only seeing a difference of about 2x. Unless you've tuned the daylights out of this code, that's basically in the noise. Most well-written programs, especially those involving data structure, can easily have more room for improvement than that. Here's an example.

    I just ran your code and took a few stackshots. Here's what I saw:

    Random Insert:

    1 Insert:64 -> HeapifyLeft:81 -> RotateRight:150
    1 Insert:64 -> Make:43 ->Treap:35
    1 Insert:68 -> Make:43
    

    Ordered Insert:

    1 Insert:61
    1 OrderedInsert:224
    1 Insert:68 -> Make:43
    1 Insert:68 -> HeapifyRight:90 -> RotateLeft:107
    1 Insert:68
    1 Insert:68 -> Insert:55 -> IsEmpty.get:51
    

    This is a pretty small number of samples, but it suggests in the case of random input that Make (line 43) is consuming a higher fraction of time. That is this code:

        private Treap<T> Make(Treap<T> left, T value, Treap<T> right, int priority)
        {
            return new Treap<T>(Comparer, left, value, right, priority);
        }
    

    I then took 20 stackshots of the Random Insert code to get a better idea of what it was doing:

    1 Insert:61
    4 Insert:64
    3 Insert:68
    2 Insert:68 -> Make:43
    1 Insert:64 -> Make:43
    1 Insert:68 -> Insert:57 -> Make:48 -> Make:43
    2 Insert:68 -> Insert:55
    1 Insert:64 -> Insert:55
    1 Insert:64 -> HeapifyLeft:81 -> RotateRight:150
    1 Insert:64 -> Make:43 -> Treap:35
    1 Insert:68 -> HeapifyRight:90 -> RotateLeft:107 -> IsEmpty.get:51
    1 Insert:68 -> HeapifyRight:88
    1 Insert:61 -> AnonymousMethod:214
    

    This reveals some information.
    25% of time is spent in line Make:43 or its callees.
    15% of time is spent in that line, not in a recognized routine, in other words, in new making a new node.
    90% of time is spent in lines Insert:64 and 68 (which call Make and heapify.
    10% of time is spent in RotateLeft and Right.
    15% of time is spent in Heapify or its callees.

    I also did a fair amount of single-stepping (at the source level), and came to the suspicion that, since the tree is immutable, it spends a lot of time making new nodes because it doesn't want to change old ones. Then the old ones are garbage collected because nobody refers to them anymore.

    This has got to be inefficient.

    I'm still not answering your question of why inserting ordered numbers is faster than randomly generated numbers, but it doesn't really surprise me, because the tree is immutable.

    I don't think you can expect any performance reasoning about tree algorithms to carry over easily to immutable trees, because the slightest change deep in the tree causes it to be rebuilt on the way back out, at a high cost in new-ing and garbage collection.

    0 讨论(0)
提交回复
热议问题