Why is reading lines from stdin much slower in C++ than Python?

前端 未结 10 1807
野趣味
野趣味 2020-11-22 03:06

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Pyt

相关标签:
10条回答
  • 2020-11-22 03:42

    By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

    while (cin) {
        getline(cin, input_line);
    
        if (!cin.eof())
            line_count++;
    };
    
    0 讨论(0)
  • getline, stream operators, scanf, can be convenient if you don't care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

    Here's an example:

    //open file in binary mode
    std::fstream file( filename, std::ios::in|::std::ios::binary );
    if( !file ) return NULL;
    
    //read the size...
    file.seekg(0, std::ios::end);
    size_t length = (size_t)file.tellg();
    file.seekg(0, std::ios::beg);
    
    //read into memory buffer, then close it.
    char *filebuf = new char[length+1];
    file.read(filebuf, length);
    filebuf[length] = '\0'; //make it null-terminated
    file.close();
    

    If you want, you can wrap a stream around that buffer for more convenient access like this:

    std::istrstream header(&filebuf[0], length);
    

    Also, if you are in control of the file, consider using a flat binary data format instead of text. It's more reliable to read and write because you don't have to deal with all the ambiguities of whitespace. It's also smaller and much faster to parse.

    0 讨论(0)
  • 2020-11-22 03:45

    A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

    #include <iostream>
    #include <time.h>
    #include <cstdio>
    
    using namespace std;
    
    int main() {
        char buffer[10000];
        long line_count = 0;
        time_t start = time(NULL);
        int sec;
        int lps;
    
        int read = 1;
        while(read > 0) {
            read = scanf("%s", buffer);
            line_count++;
        };
        sec = (int) time(NULL) - start;
        line_count--;
        cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
        if (sec > 0) {
            lps = line_count / sec;
            cerr << "  Crunch speed: " << lps << endl;
        } 
        else
            cerr << endl;
        return 0;
    }
    
    0 讨论(0)
  • 2020-11-22 03:47

    tl;dr: Because of different default settings in C++ requiring more system calls.

    By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

    std::ios_base::sync_with_stdio(false);
    

    Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

    int myvalue1;
    cin >> myvalue1;
    int myvalue2;
    scanf("%d",&myvalue2);
    

    If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

    To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

    Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.

    0 讨论(0)
  • 2020-11-22 03:59

    Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

    BTW, didn't know about that sync thing, nice. But you should still try fgets.

    0 讨论(0)
  • 2020-11-22 04:01

    The following code was faster for me than the other code posted here so far: (Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).

    const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
    std::vector<char> buffer(buffer_size);
    int size;
    while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
        line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
    }
    

    It beats all my Python attempts by more than a factor 2.

    0 讨论(0)
提交回复
热议问题