Why is reading lines from stdin much slower in C++ than Python?

前端 未结 10 1809
野趣味
野趣味 2020-11-22 03:06

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Pyt

相关标签:
10条回答
  • 2020-11-22 04:03

    Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.

    C++

    ./a.out < in
    Saw 6512403 lines in 8 seconds.  Crunch speed: 814050
    

    syscalls sudo dtruss -c ./a.out < in

    CALL                                        COUNT
    __mac_syscall                                   1
    <snip>
    open                                            6
    pread                                           8
    mprotect                                       17
    mmap                                           22
    stat64                                         30
    read_nocancel                               25958
    

    Python

    ./a.py < in
    Read 6512402 lines in 1 seconds. LPS: 6512402
    

    syscalls sudo dtruss -c ./a.py < in

    CALL                                        COUNT
    __mac_syscall                                   1
    <snip>
    open                                            5
    pread                                           8
    mprotect                                       17
    mmap                                           21
    stat64                                         29
    
    0 讨论(0)
  • 2020-11-22 04:04

    In your second example (with scanf()) reason why this is still slower might be because scanf("%s") parses string and looks for any space char (space, tab, newline).

    Also, yes, CPython does some caching to avoid harddisk reads.

    0 讨论(0)
  • 2020-11-22 04:06

    I reproduced the original result on my computer using g++ on a Mac.

    Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

    std::ios_base::sync_with_stdio(false);
    char buffer[1048576];
    std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
    

    sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.

    0 讨论(0)
  • 2020-11-22 04:08

    I'm a few years behind here, but:

    In 'Edit 4/5/6' of the original post, you are using the construction:

    $ /usr/bin/time cat big_file | program_to_benchmark
    

    This is wrong in a couple of different ways:

    1. You're actually timing the execution of cat, not your benchmark. The 'user' and 'sys' CPU usage displayed by time are those of cat, not your benchmarked program. Even worse, the 'real' time is also not necessarily accurate. Depending on the implementation of cat and of pipelines in your local OS, it is possible that cat writes a final giant buffer and exits long before the reader process finishes its work.

    2. Use of cat is unnecessary and in fact counterproductive; you're adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and -- in certain generations of computers -- I/O faster than CPU) -- the mere fact that cat was running could substantially color the results. You are also subject to whatever input and output buffering and other processing cat may do. (This would likely earn you a 'Useless Use Of Cat' award if I were Randal Schwartz.

    A better construction would be:

    $ /usr/bin/time program_to_benchmark < big_file
    

    In this statement it is the shell which opens big_file, passing it to your program (well, actually to time which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you're trying to benchmark. This gets you a real reading of its performance without spurious complications.

    I will mention two possible, but actually wrong, 'fixes' which could also be considered (but I 'number' them differently as these are not things which were wrong in the original post):

    A. You could 'fix' this by timing only your program:

    $ cat big_file | /usr/bin/time program_to_benchmark
    

    B. or by timing the entire pipeline:

    $ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'
    

    These are wrong for the same reasons as #2: they're still using cat unnecessarily. I mention them for a few reasons:

    • they're more 'natural' for people who aren't entirely comfortable with the I/O redirection facilities of the POSIX shell

    • there may be cases where cat is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output)

    • in practice, on modern machines, the added cat in the pipeline is probably of no real consequence.

    But I say that last thing with some hesitation. If we examine the last result in 'Edit 5' --

    $ /usr/bin/time cat temp_big_file | wc -l
    0.01user 1.34system 0:01.83elapsed 74%CPU ...
    

    -- this claims that cat consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

    $ /usr/bin/time wc -l < temp_big_file
    

    would have taken only the remaining .49 seconds! Probably not: cat here had to pay for the read() system calls (or equivalent) which transferred the file from 'disk' (actually buffer cache), as well as the pipe writes to deliver them to wc. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

    Still, I predict you would be able to measure the difference between cat file | wc -l and wc -l < file and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

    In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually 'best of 3' results; after priming the cache, of course):

    $ time wc -l < /tmp/junk
    real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
    $ time cat /tmp/junk | wc -l
    real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
    $ time sh -c 'cat /tmp/junk | wc -l'
    real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)
    

    Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I'm using the shell (bash)'s built-in 'time' command, which is cognizant of the pipeline; and I'm on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime -- showing that it can only time the single pipeline element passed to it on its command line. Also, the shell's output gives milliseconds while /usr/bin/time only gives hundredths of a second.

    So at the efficiency level of wc -l, the cat makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

    I should add that there is at least one other significant difference between these styles of testing, and I can't say whether it is a benefit or fault; you have to decide this yourself:

    When you run cat big_file | /usr/bin/time my_program, your program is receiving input from a pipe, at precisely the pace sent by cat, and in chunks no larger than written by cat.

    When you run /usr/bin/time my_program < big_file, your program receives an open file descriptor to the actual file. Your program -- or in many cases the I/O libraries of the language in which it was written -- may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the cat binary.

    Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the cat result by some small factor to "forgive" the cost of running cat itself.

    0 讨论(0)
提交回复
热议问题