I like some features of D, but would be interested if they come with a runtime penalty?
To compare, I implemented a simple program that computes scalar products of m
One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.
I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.
Definitely seems like a quality-of-implementation issue.
I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs
and associated scalars). See below for some numbers.
Is it intentional that the same seed be used for each iteration of C++, while not so for D?
I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.
After this, I made the following changes:
Used uninitializedArray
to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.
Factored out printing code and replaced writefln
with writeln
^^
) instead of manual multiplication for final step of calculating averagesize_type
and replaced appropriately with the new index_type
alias...thus resulting in scalar2.cpp
(pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
for(index_type i = 0; i < N; ++i)
xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
After testing scalar2.d
(which prioritized optimization for speed), out of curiousity I replaced the loops in main
with foreach
equivalents, and called it scalar3.d
(pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
foreach(ref x; xs)
x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
foreach(ref x; xs)
foreach(ref val; x)
val = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
foreach(const ref x; xs)
foreach(const ref y; xs)
avg += scalar_product(x, y);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:
clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2
I used the following commands to compile each:
clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>
The results (screenshot of raw console output) of each version of the source as follows:
scalar.cpp
(original C++):
allocation: 2 ms
random generation: 12 ms
result: 29248300000
time: 2582 ms
C++ sets the standard at 2582 ms.
scalar.d
(modified OP source):
allocation: 5 ms, 293 μs, and 5 hnsecs
random: 10 ms, 866 μs, and 4 hnsecs
result: 53237080000
scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs
This ran for ~2957 ms. Slower than the C++ implementation, but not too much.
scalar2.d
(index/length type change and uninitializedArray optimization):
allocation: 2 ms, 464 μs, and 2 hnsecs
random: 5 ms, 792 μs, and 6 hnsecs
result: 59
scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
In other words, ~1860 ms. So far this is in the lead.
scalar3.d
(foreaches):
allocation: 2 ms, 911 μs, and 3 hnsecs
random: 7 ms, 567 μs, and 8 hnsecs
result: 189
scalar products: 2 secs, 182 ms, and 366 μs
~2182 ms is slower than scalar2.d
, but faster than the C++ version.
With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.