Why is the first run always much slower?

只谈情不闲聊 提交于 2019-12-07 12:03:52

问题


I wrote a macro that reports the time required to run a given operation. It runs it a number of times and prints out the time for each run in nanoseconds. The first run always takes significantly more time than subsequent ones. Why is that so?

Here are the results of 10 x 10 runs, timing Thread.yield():

> (dotimes [x 10] (prn (times 10 (Thread/yield))))

[55395 1659 622 561 591 702 795 719 742 624]
[3255 772 884 677 787 634 605 664 629 657]
[3431 789 965 671 774 767 627 627 521 717]
[2653 780 619 632 616 614 606 602 629 667]
[2373 759 700 676 557 639 659 654 659 676]
[2884 929 627 604 689 614 614 666 588 596]
[2796 749 672 769 667 852 629 589 627 802]
[1308 514 395 321 352 345 411 339 436 315]
[1390 363 328 337 330 321 324 347 333 342]
[1461 416 410 320 414 381 380 388 388 396]

The first run of the first batch is extremely slow, I guess that's due to the JIT seeing the code for the first time - fair enough. But the first runs in all subsequent batches are also significantly slower than following runs. Why?

The code for the times macro:

(defmacro time
  [expr]
  `(let [t1# (System/nanoTime)]
     ~expr 
     (- (System/nanoTime) t1#)))

(defmacro times
  [reps expr]
  `(loop [reps# ~reps times# []]
     (if (zero? reps#) 
       times#
       (recur (dec reps#) (conj times# (time ~expr))))))

Decompiling yields the following, so System.nanoTime() seems to be called directly before and after Thread.yield(), as intended:

> (decompile (dotimes [x 10] (prn (times 10 (Thread/yield)))))

...

public Object invoke() {
    long reps__1952__auto__2355 = 10L;
    Object times__1953__auto__2356 = PersistentVector.EMPTY;
    while (reps__1952__auto__2355 != 0L) {
        final long dec = Numbers.dec(reps__1952__auto__2355);
        final IFn fn = (IFn)const__3.getRawRoot();
        final Object o = times__1953__auto__2356;
        times__1953__auto__2356 = null;
        final long t1__1946__auto__2354 = System.nanoTime();
        Thread.yield();
        times__1953__auto__2356 = fn.invoke(o, Numbers.num(Numbers.minus(System.nanoTime(), t1__1946__auto__2354)));
        reps__1952__auto__2355 = dec;
    }
    final Object o2 = times__1953__auto__2356;
    times__1953__auto__2356 = null;
    return o2;
}

回答1:


The first run always takes significantly more time than subsequent ones. Why is that so?

There's another tricky dependency factoring into your benchmark results: I/O. Try a few test runs that return the timing vectors rather than print them, and you should see results more in line with this:

(for [_ (range 10)]
  (times 10 (Thread/yield)))
=>
([32674 1539 1068 1063 1027 1026 1025 1031 1034 1035]
 [1335 1048 1030 1036 1043 1037 1036 1031 1034 1047]
 [1088 1043 1029 1035 1045 1035 1036 1035 1045 1047]
 [1051 1037 1032 1031 1048 1045 1039 1045 1042 1037]
 [1054 1048 1032 1036 1046 1029 1038 1038 1039 1051]
 [1050 1051 1039 1037 1038 1035 1030 1030 1045 1031]
 [1054 1045 1034 1034 1045 1037 1037 1035 1046 1044]
 [1051 1041 1032 1050 1061 1039 1045 1041 1057 1034]
 [1052 1042 1034 1032 1035 1045 1043 1038 1052 1052]
 [1053 1053 1041 1043 1053 1044 1039 1042 1051 1038])

If you use System.out.println in your benchmark instead of prn, you should see the same slow-down behavior but much less exaggerated:

(dotimes [x 10]
  (.println System/out (times 10 (Thread/yield))))
=> nil
[33521 1733 1232 1161 1150 1135 1151 1138 1143 1144]
[1724 1205 1149 1152 1141 1149 1149 1150 1139 1145]
[1368 1156 1141 1139 1147 1149 1141 1147 1141 1149]
[1306 1159 1150 1141 1150 1148 1147 1142 1144 1149]
[1329 1161 1155 1144 1140 1155 1151 1149 1149 1140]
[1319 1154 1140 1143 1147 1154 1156 1149 1148 1145]
[1291 1166 1164 1149 1140 1150 1140 1152 1141 1139]
[4482 1194 1148 1150 1137 1165 1163 1154 1149 1152]
[1333 1184 1162 1163 1138 1149 1150 1151 1137 1145]
[1318 1150 1144 1150 1151 1147 1138 1147 1143 1149]



回答2:


You can see this effect even with a much less expensive, and less IO-bound, operation than (Thread/yield), such as the constant expression 5:

user=> (doall (for [_ (range 10)] (times 10 5)))
[[390 132 134 132 109 86 94 109 115 112]
 [115 117 114 112 112 89 112 112 115 89]
 [117 106 109 109 109 86 109 109 111 109]
 [121 106 103 103 109 86 106 106 129 109]
 [117 109 106 109 112 95 111 112 109 89]
 [112 112 111 111 114 92 109 112 109 114]
 [118 111 112 111 115 88 112 109 115 92]
 [112 108 108 111 109 92 109 109 118 89]
 [115 106 112 115 112 89 112 109 114 89]
 [117 109 112 112 114 89 114 112 111 91]]

Quite interesting, isn't it? The first expression is always the slowest, or at least very close to the slowest, and bizarrely the sixth and tenth tend to be the fastest. Why should this be?

My best guess is just the mysterious power of HotSpot. There are a number of dynamic-dispatch methods being called even in this very short snippet. You call conj as an IFn, and perhaps HotSpot builds up some confidence that most of your IFn calls will be to conj, and so it tries to make that use case faster; but at the end of each iteration of 10 there are some other functions being called, to append to the larger result list, and so HotSpot backs off its optimizations anticipating you will start doing something else.

Or maybe it's not HotSpot at all, but rather some interaction with the CPU cache, or the operating system's virtual memory manager, or...

Of course this specific scenario is all speculation, but the point is that even when you write very simple code, you rely on a large number of very complicated systems to run it for you, and the end result is basically unknowable without devoting a great deal of study to each of the systems involved.



来源:https://stackoverflow.com/questions/48741921/why-is-the-first-run-always-much-slower

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!