Fastest way to print a single line in a file

前端 未结 5 802
礼貌的吻别
礼貌的吻别 2021-02-02 12:44

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option

相关标签:
5条回答
  • 2021-02-02 13:23

    Drop the useless use of cat and do:

    $ sed -n '1{p;q}' file
    

    This will quit the sed script after the line has been printed.


    Benchmarking script:

    #!/bin/bash
    
    TIMEFORMAT='%3R'
    n=25
    heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
    
    # files upto a hundred million lines (if your on slow machine decrease!!)
    for (( j=1; j<=100,000,000;j=j*10 ))
    do
        echo "Lines in file: $j"
        # create file containing j lines
        seq 1 $j > file
        # initial read of file
        cat file > /dev/null
    
        for comm in {0..3}
        do
            avg=0
            echo
            echo ${heading[$comm]}    
            for (( i=1; i<=$n; i++ ))
            do
                case $comm in
                    0)
                        t=$( { time head -1 file > /dev/null; } 2>&1);;
                    1)
                        t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
                    2)
                        t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
                    3)
                        t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
                esac
                avg=$avg+$t
            done
            echo "scale=3;($avg)/$n" | bc
        done
    done
    

    Just save as benchmark.sh and run bash benchmark.sh.

    Results:

    head -1 file
    .001
    
    sed -n 1p file
    .048
    
    sed -n '1{p;q} file
    .002
    
    read line < file && echo $line
    0
    

    **Results from file with 1,000,000 lines.*

    So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:

    enter image description here

    Note: timings are different from original post due to being on a faster Linux box.

    0 讨论(0)
  • 2021-02-02 13:24

    How about avoiding pipes? Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).

    Examples:

    sed -n '1{p;q}' /path/to/file
    head -n 1 /path/to/file
    

    Again, I didn't test the efficiency.

    0 讨论(0)
  • 2021-02-02 13:33

    I have done extensive testing, and found that, if you want every line of a file:

    while IFS=$'\n' read LINE; do
      echo "$LINE"
    done < your_input.txt
    

    Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.

    On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)

    The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.

    0 讨论(0)
  • 2021-02-02 13:34

    If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

    The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

    For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

    All of this caching effect "interference" is both OS and hardware dependent.

    So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

    this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

    sed: sed '1{p;q}' uopgenl20121216.lis

    real    0m0.917s
    user    0m0.258s
    sys     0m0.492s
    

    read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

    real    0m0.017s
    user    0m0.000s
    sys     0m0.015s
    

    This is clearly contrived, but does show the difference between builtin performance vs using a command.

    0 讨论(0)
  • 2021-02-02 13:34

    If you want to print only 1 line (say the 20th one) from a large file you could also do:

    head -20 filename | tail -1
    

    I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.

    Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099

    $wc -l english
    36374448 english
    
    $time for i in {0..99}; do j=$((i+10000000));  sed -n $j'{p;q}' english >/dev/null; done;
    
    real    1m27.207s
    user    1m20.712s
    sys     0m6.284s
    

    vs.

    $time for i in {0..99}; do j=$((i+10000000));  head -$j english | tail -1 >/dev/null; done;
    
    real    1m3.796s
    user    0m59.356s
    sys     0m32.376s
    

    For printing a line out of multiple files

    $wc -l english*
      36374448 english
      17797377 english.1024MB
       3461885 english.200MB
      57633710 total
    
    $time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done; 
    
    real    0m2.059s
    user    0m1.904s
    sys     0m0.144s
    
    
    
    $time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
    
    real    0m1.535s
    user    0m1.420s
    sys     0m0.788s
    
    0 讨论(0)
提交回复
热议问题