Boost::multi_array performance question

I am trying to compare the performance of boost::multi_array to native dynamically allocated arrays, with the following test program:

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure native-----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    return 0;
}

I get the following results:

[Boost] Elapsed time: 12.500 seconds
[Native]Elapsed time:  0.062 seconds

I can't believe multi_arrays are that much slower. Can anyone spot what I am doing wrong?

I assume caching is not an issue since I am doing writes to memory.

EDIT: This was a debug build. Per Laserallan's suggest I did a release build:

[Boost] Elapsed time:  0.266 seconds
[Native]Elapsed time:  0.016 seconds

Much closer. But 16 to 1 still seems to high to me.

Well, no definitive answer, but I'm going to move on and leave my real code with native arrays for now.

Accepting Laserallan's answer because it was the biggest flaw in my test.

Thanks to all.

Are you building release or debug?

If running in debug mode, the boost array might be really slow because their template magic isn't inlined properly giving lots of overhead in function calls. I'm not sure how multi array is implemented though so this might be totally off :)

Perhaps there is some difference in storage order as well so you might be having your image stored column by column and writing it row by row. This would give poor cache behavior and may slow down things.

Try switching the order of the X and Y loop and see if you gain anything. There is some info on the storage ordering here: http://www.boost.org/doc/libs/1_37_0/libs/multi_array/doc/user.html

EDIT: Since you seem to be using the two dimensional array for image processing you might be interested in checking out boosts image processing library gil.

It might have arrays with less overhead that works perfectly for your situation.

On my machine using

g++ -O3 -march=native -mtune=native --fast-math -DNDEBUG test.cpp -o test && ./test

I get

[Boost] Elapsed time:  0.020 seconds
[Native]Elapsed time:  0.020 seconds

However changing const int ITERATIONS to 5000 I get

[Boost] Elapsed time:  0.240 seconds
[Native]Elapsed time:  0.180 seconds

then with ITERATIONS back to 500 but X_SIZE and Y_SIZE set to 400 I get a much more significant difference

[Boost] Elapsed time:  0.460 seconds
[Native]Elapsed time:  0.070 seconds

finally inverting the inner loop for the [Boost] case so it looks like

    for (int x = 0; x < X_SIZE; ++x)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {

and keeping ITERATIONS, X_SIZE and Y_SIZE to 500, 400 and 400 I get

[Boost] Elapsed time:  0.060 seconds
[Native]Elapsed time:  0.080 seconds

If I invert the inner loop also for the [Native] case (so it is in the wrong order for that case), I get, unsurprisingly,

[Boost] Elapsed time:  0.070 seconds
[Native]Elapsed time:  0.450 seconds

I am using gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5 on Ubuntu 10.10

So in conclusion:

With proper optimization boost::multi_array does its job as expected
The order on which you access your data does matter

Your test is flawed.

In a DEBUG build, boost::MultiArray lacks the optimization pass that it sorely needs. (Much more than a native array would)
In a RELEASE build, your compiler will look for code that can be removed outright and most of your code is in that category.

What you're likely seeing is the result of your optimizing compiler seeing that most or all of your "native array" loops can be removed. The same is theoretically true of your boost::MultiArray loops, but MultiArray is probably complex enough to defeat your optimizer.

Make this small change to your testbed and you'll see more true-to-life results: Change both occurances of "= 2.345 " with "*= 2.345 " and compile again with optimizations. This will prevent your compiler from discovering that the outer loop of each test is redundant.

I did it and got a speed comparison closer to 2:1.

I am wondering two things:

1) bounds check: define the BOOST_DISABLE_ASSERTS preprocessor macro prior to including multi_array.hpp in your application. This turns off bound checking. not sure if this this is disables when NDEBUG is.

2) base index: MultiArray can index arrays from bases different from 0. That means that multi_array stores a base number (in each dimension) and uses a more complicated formula to obtain the exact location in memory, I am wondering if it is all about that.

Otherwise I don't understand why multiarray should be slower than C-arrays.

sivabudh

Consider using Blitz++ instead. I tried out Blitz, and its performance is on par with C-style array!

Check out your code with Blitz added below:

#include <windows.h>
#define _SCL_SECURE_NO_WARNINGS
#define BOOST_DISABLE_ASSERTS 
#include <boost/multi_array.hpp>
#include <blitz/array.h>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);


    //------------------Measure boost----------------------------------------------
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);

    //------------------Measure blitz-----------------------------------------------
    blitz::Array<double, 2> blitzArray( X_SIZE, Y_SIZE );
    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                blitzArray(x,y) = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Blitz] Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);


    //------------------Measure native-----------------------------------------------
    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    startTime = ::GetTickCount();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = ::GetTickCount();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / 1000.0);



    return 0;
}

Here's the result in debug and release.

DEBUG:

Boost  2.093 secs 
Blitz  0.375 secs 
Native 0.078 secs

RELEASE:

Boost  0.266 secs
Blitz  0.016 secs
Native 0.015 secs

I used MSVC 2008 SP1 compiler for this.

Can we now say good-bye to C-stlye array? =p

doc07b5

I was looking at this question because I had the same question. I had some thoughts to give a more rigorous test.

As rodrigob pointed out, there are flaws in the loop order such that any results in the code that you originally attached will give misleading data
Also, there are rather small sized arrays that are being set using constants. The compiler may be optimize out the loops, when in reality the compiler will not know the size of the arrays. The sizes of the arrays and number of iterations should be runtime inputs just in case.

On a Mac, the following code is configured to give more meaningful answers. There are 4 tests here.

#define BOOST_DISABLE_ASSERTS
#include "boost/multi_array.hpp"
#include <sys/time.h>
#include <stdint.h>
#include<string>

uint64_t GetTimeMs64()
{
  struct timeval tv;

  gettimeofday( &tv, NULL );

  uint64_t ret = tv.tv_usec;
  /* Convert from micro seconds (10^-6) to milliseconds (10^-3) */
  ret /= 1000;

  /* Adds the seconds (10^0) after converting them to milliseconds (10^-3) */
  ret += ( tv.tv_sec * 1000 );

  return ret;

}


void function1( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1p = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int xy = 0 ; xy < X_SIZE*Y_SIZE ; ++xy )
    {
      nativeMatrix1p[xy] += nativeMatrix1add[xy];
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native Pointer]    Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

void function2( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix1add[X_SIZE*Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix1add[y + ( x * Y_SIZE )] = rand();
    }
  }

  // Create the native array
  double* __restrict const nativeMatrix1 = new double[X_SIZE * Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix1[y + ( x * Y_SIZE )] += nativeMatrix1add[y + ( x * Y_SIZE )];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 1D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}


void function3( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  double nativeMatrix2add[X_SIZE][Y_SIZE];

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      nativeMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  double nativeMatrix2[X_SIZE][Y_SIZE];
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        nativeMatrix2[x][y] += nativeMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Native 2D Array]   Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}



void function4( const int X_SIZE, const int Y_SIZE, const int ITERATIONS )
{

  boost::multi_array<double, 2> boostMatrix2add( boost::extents[X_SIZE][Y_SIZE] );

  for( int x = 0 ; x < X_SIZE ; ++x )
  {
    for( int y = 0 ; y < Y_SIZE ; ++y )
    {
      boostMatrix2add[x][y] = rand();
    }
  }

  // Create the native array
  boost::multi_array<double, 2> boostMatrix( boost::extents[X_SIZE][Y_SIZE] );
  uint64_t startTime = GetTimeMs64();
  for( int i = 0 ; i < ITERATIONS ; ++i )
  {
    for( int x = 0 ; x < X_SIZE ; ++x )
    {
      for( int y = 0 ; y < Y_SIZE ; ++y )
      {
        boostMatrix[x][y] += boostMatrix2add[x][y];
      }
    }
  }
  uint64_t endTime = GetTimeMs64();
  printf( "[Boost Array]       Elapsed time: %6.3f seconds\n", ( endTime - startTime ) / 1000.0 );

}

int main( int argc, char* argv[] )
{

  srand( time( NULL ) );

  const int X_SIZE = std::stoi( argv[1] );
  const int Y_SIZE = std::stoi( argv[2] );
  const int ITERATIONS = std::stoi( argv[3] );

  function1( X_SIZE, Y_SIZE, ITERATIONS );
  function2( X_SIZE, Y_SIZE, ITERATIONS );
  function3( X_SIZE, Y_SIZE, ITERATIONS );
  function4( X_SIZE, Y_SIZE, ITERATIONS );

  return 0;
}

One with just a single dimensional array using the [] with integer math and a double loop
One with the same single dimensional array using pointer incrementing
A multidimensional C array
A boost multi_array

so run from a command line, run

./test_array xsize ysize iterations"

and you can get a good idea of how these approaches will perform. Here is what I got with the following compiler flags:

g++4.9.2 -O3 -march=native -funroll-loops -mno-avx --fast-math -DNDEBUG  -c -std=c++11


./test_array 51200 1 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  2.045 seconds
[Native 2D Array]   Elapsed time:  2.749 seconds
[Boost Array]       Elapsed time:  1.167 seconds

./test_array 25600 2 20000
[Native 1-Loop ]    Elapsed time:  0.531 seconds
[Native 1D Array]   Elapsed time:  1.241 seconds
[Native 2D Array]   Elapsed time:  1.631 seconds
[Boost Array]       Elapsed time:  0.954 seconds

./test_array 12800 4 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  1.214 seconds
[Native 2D Array]   Elapsed time:  1.223 seconds
[Boost Array]       Elapsed time:  0.798 seconds

./test_array 6400 8 20000
[Native 1-Loop ]    Elapsed time:  0.540 seconds
[Native 1D Array]   Elapsed time:  0.845 seconds
[Native 2D Array]   Elapsed time:  0.878 seconds
[Boost Array]       Elapsed time:  0.803 seconds

./test_array 3200 16 20000
[Native 1-Loop ]    Elapsed time:  0.537 seconds
[Native 1D Array]   Elapsed time:  0.661 seconds
[Native 2D Array]   Elapsed time:  0.673 seconds
[Boost Array]       Elapsed time:  0.708 seconds

./test_array 1600 32 20000
[Native 1-Loop ]    Elapsed time:  0.532 seconds
[Native 1D Array]   Elapsed time:  0.592 seconds
[Native 2D Array]   Elapsed time:  0.596 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 800 64 20000
[Native 1-Loop ]    Elapsed time:  0.546 seconds
[Native 1D Array]   Elapsed time:  0.594 seconds
[Native 2D Array]   Elapsed time:  0.606 seconds
[Boost Array]       Elapsed time:  0.764 seconds

./test_array 400 128 20000
[Native 1-Loop ]    Elapsed time:  0.536 seconds
[Native 1D Array]   Elapsed time:  0.560 seconds
[Native 2D Array]   Elapsed time:  0.564 seconds
[Boost Array]       Elapsed time:  0.746 seconds

So, I think that it is safe to say that the boost multi_array performs pretty good. Nothing beats a single loop evaluation, but depending on the dimension of the array, the boost::multi_array may beat a standard c-array with a double loop.

Another thing to try is to use iterators instead of a straight index for the boost array.

I would have expected multiarray to be just as efficient. But I'm getting similar results on a PPC Mac using gcc. I also tried multiarrayref, so that both versions were using the same storage with no difference. This is good to know, since I use multiarray in some of my code, and just assumed it was similar to hand-coding.

I think I know what the problem is...maybe.

In order for the boost implementation to have a syntax like: matrix[x][y]. that means that matrix[x] has to return a reference to an object which acts like a 1D array column, at which point reference[y] gives you your element.

The problem here is that you are iterating in row major order (which is typical in c/c++ since native arrays are row major IIRC. The compiler has to re-execute matrix[x] for each y in this case. If you iterated in column major order when using the boost matrix, you may see better performance.

Just a theory.

EDIT: on my linux system (with some minor changes) I tested my theory, and did show some performance improvement by switching x and y, but it was still slower than a native array. This might be a simple issue of the compiler not being able to optimize away the temporary reference type.

Build in release mode, use objdump, and look at the assembly. They may be doing completely different things, and you'll be able to see which optimizations the compiler is using.

A similar question was asked and answered here:

http://www.codeguru.com/forum/archive/index.php/t-300014.html

The short answer is that it is easiest for the compiler to optimize the simple arrays, and not so easy to optimize the Boost version. Hence, a particular compiler may not give the Boost version all the same optimization benefits.

Compilers can also vary in how well they will optimize vs. how conservative they will be (e.g. with templated code or other complications).

I tested on a Snow Leopard Mac OS using gcc 4.2.1

Debug:
[Boost] Elapsed time:  2.268 seconds
[Native]Elapsed time:  0.076 seconds

Release:
[Boost] Elapsed time:  0.065 seconds
[Native]Elapsed time:  0.020 seconds

Here, is the code (modified so that it can be compiled on Unix):

#define BOOST_DISABLE_ASSERTS
#include <boost/multi_array.hpp>
#include <ctime>

int main(int argc, char* argv[])
{
    const int X_SIZE = 200;
    const int Y_SIZE = 200;
    const int ITERATIONS = 500;
    unsigned int startTime = 0;
    unsigned int endTime = 0;

    // Create the boost array
    typedef boost::multi_array<double, 2> ImageArrayType;
    ImageArrayType boostMatrix(boost::extents[X_SIZE][Y_SIZE]);

    // Create the native array
    double *nativeMatrix = new double [X_SIZE * Y_SIZE];

    //------------------Measure boost----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                boostMatrix[x][y] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Boost] Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    //------------------Measure native-----------------------------------------------
    startTime = clock();
    for (int i = 0; i < ITERATIONS; ++i)
    {
        for (int y = 0; y < Y_SIZE; ++y)
        {
            for (int x = 0; x < X_SIZE; ++x)
            {
                nativeMatrix[x + (y * X_SIZE)] = 2.345;
            }
        }
    }
    endTime = clock();
    printf("[Native]Elapsed time: %6.3f seconds\n", (endTime - startTime) / (double)CLOCKS_PER_SEC);

    return 0;
}

Looking at the assembly generated by g++ 4.8.2 with -O3 -DBOOST_DISABLE_ASSERTS and using both the operator() and the [][] ways to access elements, it's evident that the only extra operation compared to native arrays and manual index calculation is the addition of the base. I didn't measure the cost of this though.

I modified the above code in visual studio 2008 v9.0.21022 and applied the container routines from the Numerical Recipe routines for C and C++

http://www.nrbook.com/nr3/ using their licensed routines dmatrix and MatDoub respectively

dmatrix uses the out of date syntax malloc operator and is not recommended... MatDoub uses the New command

The speed in seconds are in Release version:

Boost: 0.437

Native: 0.032

Numerical Recipes C: 0.031

Numerical recipes C++: 0.031

So from the above blitz looks like the best free alternative.

I've compiled the code (with slight modifications) under VC++ 2010 with optimisation turned on ("Maximize Speed" together with inlining "Any Suitable" functions and "Favoring fast code") and got times 0.015/0.391. I've generated assembly listing and, though I'm a terrible assembly noob, there's one line inside the boost-measuring loop which doesn't look good to me:

call    ??A?$multi_array_ref@N$01@boost@@QAE?AV?$sub_array@N$00@multi_array@detail@1@H@Z ; boost::multi_array_ref<double,2>::operator[]

One of the [] operators didn't get inlined! The called procedure makes another call, this time to multi_array::value_accessor_n<...>::access<...>():

call    ??$access@V?$sub_array@N$00@multi_array@detail@boost@@PAN@?$value_accessor_n@N$01@multi_array@detail@boost@@IBE?AV?$sub_array@N$00@123@U?$type@V?$sub_array@N$00@multi_array@detail@boost@@@3@HPANPBIPBH3@Z ; boost::detail::multi_array::value_accessor_n<double,2>::access<boost::detail::multi_array::sub_array<double,1>,double *>

Altogether, the two procedures are quite a lot of code for simply accessing a single element in the array. My general impression is that the library is so complex and high-level that Visual Studio is unable to optimise it as much as we would like (posters using gcc apparently have got better results).

IMHO, a good compiler really should have inlined and optimised the two procedures - both are pretty short and straight-forward, don't contain any loops etc. A lot of time may be wasted simply on passing their arguments and results.

As answered by rodrigob, activating the proper optimization (GCC's default is -O0) is the key to get good performance. In addition, I also tested with Blaze DynamicMatrix , which yielded an additional factor 2 performance improvement with the exact same optimization flags. https://bitbucket.org/account/user/blaze-lib/projects/BLAZE

来源：https://stackoverflow.com/questions/446866/boostmulti-array-performance-question

标签

c++

performance

boost

boost-multi-array