Why is splitting a string slower in C++ than Python?

后端 未结 8 1675
感情败类
感情败类 2020-12-07 08:14

I\'m trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive impleme

相关标签:
8条回答
  • 2020-12-07 08:39

    I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.

    See this post for details on how to change the buffer size, then try the comparison again: Setting smaller buffer size for sys.stdin?

    0 讨论(0)
  • 2020-12-07 08:50

    As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

    If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

    The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

    #include <iostream>                                                              
    #include <string>
    #include <sstream>
    #include <time.h>
    #include <vector>
    
    using namespace std;
    
    class StringRef
    {
    private:
        char const*     begin_;
        int             size_;
    
    public:
        int size() const { return size_; }
        char const* begin() const { return begin_; }
        char const* end() const { return begin_ + size_; }
    
        StringRef( char const* const begin, int const size )
            : begin_( begin )
            , size_( size )
        {}
    };
    
    vector<StringRef> split3( string const& str, char delimiter = ' ' )
    {
        vector<StringRef>   result;
    
        enum State { inSpace, inToken };
    
        State state = inSpace;
        char const*     pTokenBegin = 0;    // Init to satisfy compiler.
        for( auto it = str.begin(); it != str.end(); ++it )
        {
            State const newState = (*it == delimiter? inSpace : inToken);
            if( newState != state )
            {
                switch( newState )
                {
                case inSpace:
                    result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                    break;
                case inToken:
                    pTokenBegin = &*it;
                }
            }
            state = newState;
        }
        if( state == inToken )
        {
            result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
        }
        return result;
    }
    
    int main() {
        string input_line;
        vector<string> spline;
        long count = 0;
        int sec, lps;
        time_t start = time(NULL);
    
        cin.sync_with_stdio(false); //disable synchronous IO
    
        while(cin) {
            getline(cin, input_line);
            //spline.clear(); //empty the vector for the next line to parse
    
            //I'm trying one of the two implementations, per compilation, obviously:
    //        split1(spline, input_line);  
            //split2(spline, input_line);
    
            vector<StringRef> const v = split3( input_line );
            count++;
        };
    
        count--; //subtract for final over-read
        sec = (int) time(NULL) - start;
        cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
        if (sec > 0) {
            lps = count / sec;
            cerr << "  Crunch speed: " << lps << endl;
        } else
            cerr << endl;
        return 0;
    }
    
    //compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x
    

    Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

    0 讨论(0)
  • 2020-12-07 08:50
    void split5(vector<string> &tokens, const string &str, char delim=' ') {
    
        enum { do_token, do_delim } state = do_delim;
        int idx = 0, tok_start = 0;
        for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
            switch (state) {
                case do_token:
                    if (it == str.end()) {
                        tokens.push_back (str.substr(tok_start, idx-tok_start));
                        return;
                    }
                    else if (*it == delim) {
                        state = do_delim;
                        tokens.push_back (str.substr(tok_start, idx-tok_start));
                    }
                    break;
    
                case do_delim:
                    if (it == str.end()) {
                        return;
                    }
                    if (*it != delim) {
                        state = do_token;
                        tok_start = idx;
                    }
                    break;
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-07 08:52

    I suspect that this is because of the way std::vector gets resized during the process of a push_back() function call. If you try using std::list or std::vector::reserve() to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():

    void split1(vector<string> &tokens, const string &str,
            const string &delimiters = " ") {
        // Skip delimiters at beginning
        string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    
        // Find first non-delimiter
        string::size_type pos = str.find_first_of(delimiters, lastPos);
        list<string> token_list;
    
        while (string::npos != pos || string::npos != lastPos) {
            // Found a token, add it to the list
            token_list.push_back(str.substr(lastPos, pos - lastPos));
            // Skip delimiters
            lastPos = str.find_first_not_of(delimiters, pos);
            // Find next non-delimiter
            pos = str.find_first_of(delimiters, lastPos);
        }
        tokens.assign(token_list.begin(), token_list.end());
    }
    

    EDIT: The other obvious thing I see is that Python variable dummy gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to be dummy = [] to initialize it and then do dummy += line.split(). Can you report the runtime after this?

    EDIT2: To make it even more fair can you modify the while loop in C++ code to be:

        while(cin) {
            getline(cin, input_line);
            std::vector<string> spline; // create a new vector
    
            //I'm trying one of the two implementations, per compilation, obviously:
    //        split1(spline, input_line);  
            split2(spline, input_line);
    
            count++;
        };
    
    0 讨论(0)
  • 2020-12-07 08:56

    I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.

    Using strtok_r (reentrant variant of strtok):

    void splitc1(vector<string> &tokens, const string &str,
            const string &delimiters = " ") {
        char *saveptr;
        char *cpy, *token;
    
        cpy = (char*)malloc(str.size() + 1);
        strcpy(cpy, str.c_str());
    
        for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
            tokens.push_back(string(token));
        }
    
        free(cpy);
    }
    

    Additionally using character strings for parameters, and fgets for input:

    void splitc2(vector<string> &tokens, const char *str,
            const char *delimiters) {
        char *saveptr;
        char *cpy, *token;
    
        cpy = (char*)malloc(strlen(str) + 1);
        strcpy(cpy, str);
    
        for(token = strtok_r(cpy, delimiters, &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters, &saveptr)) {
            tokens.push_back(string(token));
        }
    
        free(cpy);
    }
    

    And, in some cases, where destroying the input string is acceptable:

    void splitc3(vector<string> &tokens, char *str,
            const char *delimiters) {
        char *saveptr;
        char *token;
    
        for(token = strtok_r(str, delimiters, &saveptr);
            token != NULL;
            token = strtok_r(NULL, delimiters, &saveptr)) {
            tokens.push_back(string(token));
        }
    }
    

    The timings for these are as follows (including my results for the other variants from the question and the accepted answer):

    split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
    split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
    split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
    split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
    split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111
    
    splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
    splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
    splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000
    

    As we can see, the solution from the accepted answer is still fastest.

    For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.

    0 讨论(0)
  • 2020-12-07 08:58

    You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?

    0 讨论(0)
提交回复
热议问题