Why is splitting a string slower in C++ than Python?

后端 未结 8 1676
感情败类
感情败类 2020-12-07 08:14

I\'m trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive impleme

相关标签:
8条回答
  • 2020-12-07 08:59

    If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:

    void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")
    

    to this:

    void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')
    

    You get a more dramatic difference between split1 and split2, and a fairer comparison:

    split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
    split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
    split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030
    
    0 讨论(0)
  • 2020-12-07 09:06

    I think the following code is better, using some C++17 and C++14 features:

    // These codes are un-tested when I write this post, but I'll test it
    // When I'm free, and I sincerely welcome others to test and modify this
    // code.
    
    // C++17
    #include <istream>     // For std::istream.
    #include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
    #include <string>
    #include <utility>     // C++14 feature std::move.
    
    template <template <class...> class Container, class Allocator>
    void split1(Container<std::string_view, Allocator> &tokens, 
                std::string_view str,
                std::string_view delimiter = " ") 
    {
        /* 
         * The model of the input string:
         *
         * (optional) delimiter | content | delimiter | content | delimiter| 
         * ... | delimiter | content 
         *
         * Using std::string::find_first_not_of or 
         * std::string_view::find_first_not_of is a bad idea, because it 
         * actually does the following thing:
         * 
         *     Finds the first character not equal to any of the characters 
         *     in the given character sequence.
         * 
         * Which means it does not treeat your delimiters as a whole, but as
         * a group of characters.
         * 
         * This has 2 effects:
         *
         *  1. When your delimiters is not a single character, this function
         *  won't behave as you predicted.
         *
         *  2. When your delimiters is just a single character, the function
         *  may have an additional overhead due to the fact that it has to 
         *  check every character with a range of characters, although 
         * there's only one, but in order to assure the correctness, it still 
         * has an inner loop, which adds to the overhead.
         *
         * So, as a solution, I wrote the following code.
         *
         * The code below will skip the first delimiter prefix.
         * However, if there's nothing between 2 delimiter, this code'll 
         * still treat as if there's sth. there.
         *
         * Note: 
         * Here I use C++ std version of substring search algorithm, but u
         * can change it to Boyer-Moore, KMP(takes additional memory), 
         * Rabin-Karp and other algorithm to speed your code.
         * 
         */
    
        // Establish the loop invariant 1.
        typename std::string_view::size_type 
            next, 
            delimiter_size = delimiter.size(),  
            pos = str.find(delimiter) ? 0 : delimiter_size;
    
        // The loop invariant:
        //  1. At pos, it is the content that should be saved.
        //  2. The next pos of delimiter is stored in next, which could be 0
        //  or std::string_view::npos.
    
        do {
            // Find the next delimiter, maintain loop invariant 2.
            next = str.find(delimiter, pos);
    
            // Found a token, add it to the vector
            tokens.push_back(str.substr(pos, next));
    
            // Skip delimiters, maintain the loop invariant 1.
            //
            // @ next is the size of the just pushed token.
            // Because when next == std::string_view::npos, the loop will
            // terminate, so it doesn't matter even if the following 
            // expression have undefined behavior due to the overflow of 
            // argument.
            pos = next + delimiter_size;
        } while(next != std::string_view::npos);
    }   
    
    template <template <class...> class Container, class traits, class Allocator2, class Allocator>
    void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens, 
                std::istream &stream,
                char delimiter = ' ')
    {
        std::string<char, traits, Allocator2> item;
    
        // Unfortunately, std::getline can only accept a single-character 
        // delimiter.
        while(std::getline(stream, item, delimiter))
            // Move item into token. I haven't checked whether item can be 
            // reused after being moved.
            tokens.push_back(std::move(item));
    }
    

    The choice of container:

    1. std::vector.

      Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large. But, if you can estimate the size of it, this'll be less a problem.

    2. std::list.

      For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.

    3. std::forward_list.

      Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.

    4. std::array.

      If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.

    5. std::deque.

      This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.

    According to std::deque-cppreference

    On the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++)

    or you can use combo of these:

    1. std::vector< std::array<T, 2 ^ M> >

      This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.

    2. std::list< std::array<T, ...> >

      Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.

    3. std::forward_list< std::array<T, ...> >

      Same as 2, but cost the same memory as combo 1.

    0 讨论(0)
提交回复
热议问题