Bash tool to get nth line from a file

前端 未结 19 2044
刺人心
刺人心 2020-11-22 08:07

Is there a \"canonical\" way of doing that? I\'ve been using head -n | tail -1 which does the trick, but I\'ve been wondering if there\'s a Bash tool that speci

相关标签:
19条回答
  • 2020-11-22 08:50

    All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.

    Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.

    The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.

    e.g. you might create a list of character positions for newlines:

    awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx
    

    then read with tail, which actually seeks directly to the appropriate point in the file!

    e.g. to get line 1000:

    tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1
    
    • This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
    • I haven't tested this against a large file.
    • Also see this answer.
    • Alternatively - split your file into smaller files!
    0 讨论(0)
  • 2020-11-22 08:51
    # print line number 52
    sed '52!d' file
    

    Useful one-line scripts for sed

    0 讨论(0)
  • 2020-11-22 08:54

    The fastest solution for big files is always tail|head, provided that the two distances:

    • from the start of the file to the starting line. Lets call it S
    • the distance from the last line to the end of the file. Be it E

    are known. Then, we could use this:

    mycount="$E"; (( E > S )) && mycount="+$S"
    howmany="$(( endline - startline + 1 ))"
    tail -n "$mycount"| head -n "$howmany"
    

    howmany is just the count of lines required.

    Some more detail in https://unix.stackexchange.com/a/216614/79743

    0 讨论(0)
  • 2020-11-22 08:56

    According to my tests, in terms of performance and readability my recommendation is:

    tail -n+N | head -1

    N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.

    tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.


    The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:

    head -7 input.txt | tail -1

    When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.

    The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.

    In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.

    To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):

    • tail -n+N | head -1: 3.7 sec
    • head -N | tail -1: 4.6 sec
    • sed Nq;d: 18.8 sec

    Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).

    To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:

    #!/bin/bash
    readonly file=tmp-input.txt
    readonly size=1000000000
    readonly pos=500000000
    readonly retries=3
    
    seq 1 $size > $file
    echo "*** head -N | tail -1 ***"
    for i in $(seq 1 $retries) ; do
        time head "-$pos" $file | tail -1
    done
    echo "-------------------------"
    echo
    echo "*** tail -n+N | head -1 ***"
    echo
    
    seq 1 $size > $file
    ls -alhg $file
    for i in $(seq 1 $retries) ; do
        time tail -n+$pos $file | head -1
    done
    echo "-------------------------"
    echo
    echo "*** sed Nq;d ***"
    echo
    
    seq 1 $size > $file
    ls -alhg $file
    for i in $(seq 1 $retries) ; do
        time sed $pos'q;d' $file
    done
    /bin/rm $file
    

    Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:

    *** head -N | tail -1 ***
    500000000
    
    real    0m9,800s
    user    0m7,328s
    sys     0m4,081s
    500000000
    
    real    0m4,231s
    user    0m5,415s
    sys     0m2,789s
    500000000
    
    real    0m4,636s
    user    0m5,935s
    sys     0m2,684s
    -------------------------
    
    *** tail -n+N | head -1 ***
    
    -rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
    500000000
    
    real    0m6,452s
    user    0m3,367s
    sys     0m1,498s
    500000000
    
    real    0m3,890s
    user    0m2,921s
    sys     0m0,952s
    500000000
    
    real    0m3,763s
    user    0m3,004s
    sys     0m0,760s
    -------------------------
    
    *** sed Nq;d ***
    
    -rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
    500000000
    
    real    0m23,675s
    user    0m21,557s
    sys     0m1,523s
    500000000
    
    real    0m20,328s
    user    0m18,971s
    sys     0m1,308s
    500000000
    
    real    0m19,835s
    user    0m18,830s
    sys     0m1,004s
    
    0 讨论(0)
  • 2020-11-22 08:59
    sed -n '2p' < file.txt
    

    will print 2nd line

    sed -n '2011p' < file.txt
    

    2011th line

    sed -n '10,33p' < file.txt
    

    line 10 up to line 33

    sed -n '1p;3p' < file.txt
    

    1st and 3th line

    and so on...

    For adding lines with sed, you can check this:

    sed: insert a line in a certain position

    0 讨论(0)
  • 2020-11-22 08:59

    To print nth line using sed with a variable as line number:

    a=4
    sed -e $a'q:d' file
    

    Here the '-e' flag is for adding script to command to be executed.

    0 讨论(0)
提交回复
热议问题