How fast is D compared to C++?

前端 未结 8 1891
陌清茗
陌清茗 2020-12-22 15:16

I like some features of D, but would be interested if they come with a runtime penalty?

To compare, I implemented a simple program that computes scalar products of m

相关标签:
8条回答
  • 2020-12-22 16:15

    One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.

    I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.

    0 讨论(0)
  • 2020-12-22 16:16

    Definitely seems like a quality-of-implementation issue.

    I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.

    Questions for the OP

    Is it intentional that the same seed be used for each iteration of C++, while not so for D?

    Setup

    I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.

    After this, I made the following changes:

    • Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.

    • Factored out printing code and replaced writefln with writeln

    • Changed imports to be selective
    • Used pow operator (^^) instead of manual multiplication for final step of calculating average
    • Removed the size_type and replaced appropriately with the new index_type alias

    ...thus resulting in scalar2.cpp (pastebin):

        import std.stdio : writeln;
        import std.datetime : Clock, Duration;
        import std.array : uninitializedArray;
        import std.random : uniform;
    
        alias result_type = long;
        alias value_type = int;
        alias vector_t = value_type[];
        alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
    
        immutable long N = 20000;
        immutable int size = 10;
    
        // Replaced for loops with appropriate foreach versions
        value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
          value_type res = 0;
          for(index_type i = 0; i < size; ++i)
            res += x[i] * y[i];
          return res;
        }
    
        int main() {
          auto tm_before = Clock.currTime;
          auto countElapsed(in string taskName) { // Factor out printing code
            writeln(taskName, ": ", Clock.currTime - tm_before);
            tm_before = Clock.currTime;
          }
    
          // 1. allocate and fill randomly many short vectors
          vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
          for(index_type i = 0; i < N; ++i)
            xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
          countElapsed("allocation");
    
          for(index_type i = 0; i < N; ++i)
            for(index_type j = 0; j < size; ++j)
              xs[i][j] = uniform(-1000, 1000);
          countElapsed("random");
    
          // 2. compute all pairwise scalar products:
          result_type avg = 0;
          for(index_type i = 0; i < N; ++i)
            for(index_type j = 0; j < N; ++j)
              avg += scalar_product(xs[i], xs[j]);
          avg /= N ^^ 2;// Replace manual multiplication with pow operator
          writeln("result: ", avg);
          countElapsed("scalar products");
    
          return 0;
        }
    

    After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):

        import std.stdio : writeln;
        import std.datetime : Clock, Duration;
        import std.array : uninitializedArray;
        import std.random : uniform;
    
        alias result_type = long;
        alias value_type = int;
        alias vector_t = value_type[];
        alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
    
        immutable long N = 20000;
        immutable int size = 10;
    
        // Replaced for loops with appropriate foreach versions
        value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
          value_type res = 0;
          for(index_type i = 0; i < size; ++i)
            res += x[i] * y[i];
          return res;
        }
    
        int main() {
          auto tm_before = Clock.currTime;
          auto countElapsed(in string taskName) { // Factor out printing code
            writeln(taskName, ": ", Clock.currTime - tm_before);
            tm_before = Clock.currTime;
          }
    
          // 1. allocate and fill randomly many short vectors
          vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
          foreach(ref x; xs)
            x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
          countElapsed("allocation");
    
          foreach(ref x; xs)
            foreach(ref val; x)
              val = uniform(-1000, 1000);
          countElapsed("random");
    
          // 2. compute all pairwise scalar products:
          result_type avg = 0;
          foreach(const ref x; xs)
            foreach(const ref y; xs)
              avg += scalar_product(x, y);
          avg /= N ^^ 2;// Replace manual multiplication with pow operator
          writeln("result: ", avg);
          countElapsed("scalar products");
    
          return 0;
        }
    

    I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:

    • clang 3.6.0-3
    • ldc 1:0.15.1-4
    • dtools 2.067.0-2

    I used the following commands to compile each:

    • C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
    • D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>

    Results

    The results (screenshot of raw console output) of each version of the source as follows:

    1. scalar.cpp (original C++):

      allocation: 2 ms
      
      random generation: 12 ms
      
      result: 29248300000
      
      time: 2582 ms
      

      C++ sets the standard at 2582 ms.

    2. scalar.d (modified OP source):

      allocation: 5 ms, 293 μs, and 5 hnsecs 
      
      random: 10 ms, 866 μs, and 4 hnsecs 
      
      result: 53237080000
      
      scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs 
      

      This ran for ~2957 ms. Slower than the C++ implementation, but not too much.

    3. scalar2.d (index/length type change and uninitializedArray optimization):

      allocation: 2 ms, 464 μs, and 2 hnsecs
      
      random: 5 ms, 792 μs, and 6 hnsecs
      
      result: 59
      
      scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
      

      In other words, ~1860 ms. So far this is in the lead.

    4. scalar3.d (foreaches):

      allocation: 2 ms, 911 μs, and 3 hnsecs
      
      random: 7 ms, 567 μs, and 8 hnsecs
      
      result: 189
      
      scalar products: 2 secs, 182 ms, and 366 μs
      

      ~2182 ms is slower than scalar2.d, but faster than the C++ version.

    Conclusion

    With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.

    0 讨论(0)
提交回复
热议问题