Fix arithmetic error in distributed version

前端 未结 2 1863
悲&欢浪女
悲&欢浪女 2021-01-22 04:32

I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed projec

相关标签:
2条回答
  • 2021-01-22 04:57

    As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As @HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.

    #include <iostream>
    #include <cmath>
    
    double calc_error(double a,double x)
    {
      return std::abs(x-a)/std::abs(a);
    }
    int main(void)
    {
      double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
      double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
      double err[5];
      std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
      for (int it=0; it<5; it++) {
        err[it]=calc_error(sans[it], pans[it]);
        std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
      }
    return 0;
    }
    

    Which produces this output:

    Serial Answer,Distributed Answer, Error
    -2.50208e+08,-2.50208e+08,4.03665e-13
    -1.3532e+09,-1.3532e+09,7.24136e-14
    2.81697e+09,2.81697e+09,2.46631e-13
    -1.44345e+11,-1.44345e+11,4.65127e-15
    3.2389e+11,3.2389e+11,0
    

    As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.

    When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.

    Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.

    In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

    0 讨论(0)
  • 2021-01-22 05:03

    Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.

    Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.

    In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...

    0 讨论(0)
提交回复
热议问题