Is std::vector so much slower than plain arrays?

后端 未结 22 2335
南方客
南方客 2020-11-22 12:00

I\'ve always thought it\'s the general wisdom that std::vector is \"implemented as an array,\" blah blah blah. Today I went down and tested it, and it seems to

相关标签:
22条回答
  • 2020-11-22 12:17

    To be fair, you cannot compare a C++ implementation to a C implementation, as I would call your malloc version. malloc does not create objects - it only allocates raw memory. That you then treat that memory as objects without calling the constructor is poor C++ (possibly invalid - I'll leave that to the language lawyers).

    That said, simply changing the malloc to new Pixel[dimensions*dimensions] and free to delete [] pixels does not make much difference with the simple implementation of Pixel that you have. Here's the results on my box (E6600, 64-bit):

    UseArray completed in 0.269 seconds
    UseVector completed in 1.665 seconds
    UseVectorPushBack completed in 7.309 seconds
    The whole thing completed in 9.244 seconds
    

    But with a slight change, the tables turn:

    Pixel.h

    struct Pixel
    {
        Pixel();
        Pixel(unsigned char r, unsigned char g, unsigned char b);
    
        unsigned char r, g, b;
    };
    

    Pixel.cc

    #include "Pixel.h"
    
    Pixel::Pixel() {}
    Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b) 
      : r(r), g(g), b(b) {}
    

    main.cc

    #include "Pixel.h"
    [rest of test harness without class Pixel]
    [UseArray now uses new/delete not malloc/free]
    

    Compiled this way:

    $ g++ -O3 -c -o Pixel.o Pixel.cc
    $ g++ -O3 -c -o main.o main.cc
    $ g++ -o main main.o Pixel.o
    

    we get very different results:

    UseArray completed in 2.78 seconds
    UseVector completed in 1.651 seconds
    UseVectorPushBack completed in 7.826 seconds
    The whole thing completed in 12.258 seconds
    

    With a non-inlined constructor for Pixel, std::vector now beats a raw array.

    It would appear that the complexity of allocation through std::vector and std:allocator is too much to be optimised as effectively as a simple new Pixel[n]. However, we can see that the problem is simply with the allocation not the vector access by tweaking a couple of the test functions to create the vector/array once by moving it outside the loop:

    void UseVector()
    {
        TestTimer t("UseVector");
    
        int dimension = 999;
        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);
    
        for(int i = 0; i < 1000; ++i)
        {
            for(int i = 0; i < dimension * dimension; ++i)
            {
                pixels[i].r = 255;
                pixels[i].g = 0;
                pixels[i].b = 0;
            }
        }
    }
    

    and

    void UseArray()
    {
        TestTimer t("UseArray");
    
        int dimension = 999;
        Pixel * pixels = new Pixel[dimension * dimension];
    
        for(int i = 0; i < 1000; ++i)
        {
            for(int i = 0 ; i < dimension * dimension; ++i)
            {
                pixels[i].r = 255;
                pixels[i].g = 0;
                pixels[i].b = 0;
            }
        }
        delete [] pixels;
    }
    

    We get these results now:

    UseArray completed in 0.254 seconds
    UseVector completed in 0.249 seconds
    UseVectorPushBack completed in 7.298 seconds
    The whole thing completed in 7.802 seconds
    

    What we can learn from this is that std::vector is comparable to a raw array for access, but if you need to create and delete the vector/array many times, creating a complex object will be more time consuming that creating a simple array when the element's constructor is not inlined. I don't think that this is very surprising.

    0 讨论(0)
  • 2020-11-22 12:17

    It seems to depend on the compiler flags. Here is a benchmark code:

    #include <chrono>
    #include <cmath>
    #include <ctime>
    #include <iostream>
    #include <vector>
    
    
    int main(){
    
        int size = 1000000; // reduce this number in case your program crashes
        int L = 10;
    
        std::cout << "size=" << size << " L=" << L << std::endl;
        {
            srand( time(0) );
            double * data = new double[size];
            double result = 0.;
            std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
            for( int l = 0; l < L; l++ ) {
                for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
                for( int i = 0; i < size; i++ ) result += data[i] * data[i];
            }
            std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
            std::cout << "Calculation result is " << sqrt(result) << "\n";
            std::cout << "Duration of C style heap array:    " << duration << "ms\n";
            delete data;
        }
    
        {
            srand( 1 + time(0) );
            double data[size]; // technically, non-compliant with C++ standard.
            double result = 0.;
            std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
            for( int l = 0; l < L; l++ ) {
                for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
                for( int i = 0; i < size; i++ ) result += data[i] * data[i];
            }
            std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
            std::cout << "Calculation result is " << sqrt(result) << "\n";
            std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
        }
    
        {
            srand( 2 + time(0) );
            std::vector<double> data( size );
            double result = 0.;
            std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
            for( int l = 0; l < L; l++ ) {
                for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
                for( int i = 0; i < size; i++ ) result += data[i] * data[i];
            }
            std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
            auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
            std::cout << "Calculation result is " << sqrt(result) << "\n";
            std::cout << "Duration of std::vector array:     " << duration << "ms\n";
        }
    
        return 0;
    }
    

    Different optimization flags give different answers:

    $ g++ -O0 benchmark.cpp 
    $ ./a.out 
    size=1000000 L=10
    Calculation result is 181182
    Duration of C style heap array:    118441ms
    Calculation result is 181240
    Duration of C99 style stack array: 104920ms
    Calculation result is 181210
    Duration of std::vector array:     124477ms
    $g++ -O3 benchmark.cpp
    $ ./a.out 
    size=1000000 L=10
    Calculation result is 181213
    Duration of C style heap array:    107803ms
    Calculation result is 181198
    Duration of C99 style stack array: 87247ms
    Calculation result is 181204
    Duration of std::vector array:     89083ms
    $ g++ -Ofast benchmark.cpp 
    $ ./a.out 
    size=1000000 L=10
    Calculation result is 181164
    Duration of C style heap array:    93530ms
    Calculation result is 181179
    Duration of C99 style stack array: 80620ms
    Calculation result is 181191
    Duration of std::vector array:     78830ms
    

    Your exact results will vary but this is quite typical on my machine.

    0 讨论(0)
  • 2020-11-22 12:17

    In my experience, sometimes, just sometimes, vector<int> can be many times slower than int[]. One thing to keep mind is that vectors of vectors are very unlike int[][]. As the elements are probably not contiguous in memory. This means you can resize different vectors inside of the main one, but CPU might not be able to cache elements as well as in the case of int[][].

    0 讨论(0)
  • 2020-11-22 12:19

    This is an old but popular question.

    At this point, many programmers will be working in C++11. And in C++11 the OP's code as written runs equally fast for UseArray or UseVector.

    UseVector completed in 3.74482 seconds
    UseArray completed in 3.70414 seconds
    

    The fundamental problem was that while your Pixel structure was uninitialized, std::vector<T>::resize( size_t, T const&=T() ) takes a default constructed Pixel and copies it. The compiler did not notice it was being asked to copy uninitialized data, so it actually performed the copy.

    In C++11, std::vector<T>::resize has two overloads. The first is std::vector<T>::resize(size_t), the other is std::vector<T>::resize(size_t, T const&). This means when you invoke resize without a second argument, it simply default constructs, and the compiler is smart enough to realize that default construction does nothing, so it skips the pass over the buffer.

    (The two overloads where added to handle movable, constructable and non-copyable types -- the performance improvement when working on uninitialized data is a bonus).

    The push_back solution also does fencepost checking, which slows it down, so it remains slower than the malloc version.

    live example (I also replaced the timer with chrono::high_resolution_clock).

    Note that if you have a structure that usually requires initialization, but you want to handle it after growing your buffer, you can do this with a custom std::vector allocator. If you want to then move it into a more normal std::vector, I believe careful use of allocator_traits and overriding of == might pull that off, but am unsure.

    0 讨论(0)
  • By the way the slow down your seeing in classes using vector also occurs with standard types like int. Heres a multithreaded code:

    #include <iostream>
    #include <cstdio>
    #include <map>
    #include <string>
    #include <typeinfo>
    #include <vector>
    #include <pthread.h>
    #include <sstream>
    #include <fstream>
    using namespace std;
    
    //pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;
    
    long long num=500000000;
    int procs=1;
    
    struct iterate
    {
        int id;
        int num;
        void * member;
        iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
    };
    
    //fill out viterate and piterate
    void * viterate(void * input)
    {
        printf("am in viterate\n");
        iterate * info=static_cast<iterate *> (input);
        // reproduce member type
        vector<int> test= *static_cast<vector<int>*> (info->member);
        for (int i=info->id; i<test.size(); i+=info->num)
        {
            //printf("am in viterate loop\n");
            test[i];
        }
        pthread_exit(NULL);
    }
    
    void * piterate(void * input)
    {
        printf("am in piterate\n");
        iterate * info=static_cast<iterate *> (input);;
        int * test=static_cast<int *> (info->member);
        for (int i=info->id; i<num; i+=info->num) {
            //printf("am in piterate loop\n");
            test[i];
        }
        pthread_exit(NULL);
    }
    
    int main()
    {
        cout<<"producing vector of size "<<num<<endl;
        vector<int> vtest(num);
        cout<<"produced  a vector of size "<<vtest.size()<<endl;
        pthread_t thread[procs];
    
        iterate** it=new iterate*[procs];
        int ans;
        void *status;
    
        cout<<"begining to thread through the vector\n";
        for (int i=0; i<procs; i++) {
            it[i]=new iterate(i, procs, (void *) &vtest);
        //  ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
        }
        for (int i=0; i<procs; i++) {
            pthread_join(thread[i], &status);
        }
        cout<<"end of threading through the vector";
        //reuse the iterate structures
    
        cout<<"producing a pointer with size "<<num<<endl;
        int * pint=new int[num];
        cout<<"produced a pointer with size "<<num<<endl;
    
        cout<<"begining to thread through the pointer\n";
        for (int i=0; i<procs; i++) {
            it[i]->member=&pint;
            ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
        }
        for (int i=0; i<procs; i++) {
            pthread_join(thread[i], &status);
        }
        cout<<"end of threading through the pointer\n";
    
        //delete structure array for iterate
        for (int i=0; i<procs; i++) {
            delete it[i];
        }
        delete [] it;
    
        //delete pointer
        delete [] pint;
    
        cout<<"end of the program"<<endl;
        return 0;
    }
    

    The behavior from the code shows the instantiation of vector is the longest part of the code. Once you get through that bottle neck. The rest of the code runs extremely fast. This is true no matter how many threads you are running on.

    By the way ignore the absolutely insane number of includes. I have been using this code to test things for a project so the number of includes keep growing.

    0 讨论(0)
  • 2020-11-22 12:23

    I Have to say I'm not an expert in C++. But to add some experiments results:

    compile: gcc-6.2.0/bin/g++ -O3 -std=c++14 vector.cpp

    machine:

    Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 
    

    OS:

    2.6.32-642.13.1.el6.x86_64
    

    Output:

    UseArray completed in 0.167821 seconds
    UseVector completed in 0.134402 seconds
    UseConstructor completed in 0.134806 seconds
    UseFillConstructor completed in 1.00279 seconds
    UseVectorPushBack completed in 6.6887 seconds
    The whole thing completed in 8.12888 seconds
    

    Here the only thing I feel strange is that "UseFillConstructor" performance compared with "UseConstructor".

    The code:

    void UseConstructor()
    {
        TestTimer t("UseConstructor");
    
        for(int i = 0; i < 1000; ++i)
        {
            int dimension = 999;
    
            std::vector<Pixel> pixels(dimension*dimension);
            for(int i = 0; i < dimension * dimension; ++i)
            {
                pixels[i].r = 255;
                pixels[i].g = 0;
                pixels[i].b = 0;
            }
        }
    }
    
    
    void UseFillConstructor()
    {
        TestTimer t("UseFillConstructor");
    
        for(int i = 0; i < 1000; ++i)
        {
            int dimension = 999;
    
            std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
        }
    }
    

    So the additional "value" provided slows down performance quite a lot, which I think is due to multiple call to copy constructor. But...

    Compile:

    gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp
    

    Output:

    UseArray completed in 1.02464 seconds
    UseVector completed in 1.31056 seconds
    UseConstructor completed in 1.47413 seconds
    UseFillConstructor completed in 1.01555 seconds
    UseVectorPushBack completed in 6.9597 seconds
    The whole thing completed in 11.7851 seconds
    

    So in this case, gcc optimization is very important but it can't help you much when a value is provided as default. This, is against my tuition actually. Hopefully it helps new programmer when choose which vector initialization format.

    0 讨论(0)
提交回复
热议问题