Split file by vector of line numbers

前端 未结 5 1662
礼貌的吻别
礼貌的吻别 2021-01-27 03:41

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilitie

相关标签:
5条回答
  • 2021-01-27 04:01

    This might work for you:

    csplit -z file 2 5
    

    or if you want regexp:

    csplit -z file /2/ /5/
    

    With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.

    N.B. The -z option prevents empty elided files.

    0 讨论(0)
  • 2021-01-27 04:04

    Here is a little awk that does the trick for you:

    awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
                    index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
                    { print > f }' file
    

    This will create files of the form: file.1, file.2, file.3, ...

    0 讨论(0)
  • 2021-01-27 04:09

    Using awk:

    $ awk -v v="2 5" '       # space-separated vector if indexes
    BEGIN {
        n=split(v,t)         # reshape vector to a hash
        for(i=1;i<=n;i++)
            a[t[i]]
        i=1                  # filename index
    }
    {
        if(NR in a) {        # file record counter in the vector
            close("file" i)  # close previous file
            i++              # increase filename index
        }
        print > ("file" i)   # output to file
    }' file
    

    Sample output:

    $ cat file2
    4 5 6
    7 8 9 
    10 11 12 
    
    0 讨论(0)
  • 2021-01-27 04:10

    Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"

    vec="2 5"
    
    awk '
        NR == FNR {nr[$1]; next}
        FNR == 1 {filenum = 1; f = FILENAME "." filenum}
        FNR in nr {
            close(f)
            f = FILENAME "." ++filenum
        }
        {print > f}
    ' <(printf "%s\n" $vec) file
    
    $ ls -l file file.*
    -rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
    -rw-r--r-- 1 glenn glenn  7 Jul 17 10:09 file.1
    -rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
    -rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3
    
    0 讨论(0)
  • 2021-01-27 04:20

    Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.

    Usage:

    • put the script in a file (e.g. make.sed) and chmod +x it;
    • then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹

    Note that ./make.sed <<< '1 4' generates the following sed script:

    1,1{w file.1
    be};1,4{w file.2
    be};1,${w file.3
    be};:e
    

    ¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.

    #!/usr/bin/env -S sed -Ef
    
    ###########################################################
    # Main
    # make a template sed script, in which we only have to increase
    # the number of each numbered output file, each of which is marked
    # with a trailing \x0
    b makeSkeletonAndMarkNumbers
    :skeletonMade
    
    # try putting a stencil on the rightmost digit of the first marked number on
    # the line and loop, otherwise exit
    b stencilLeastDigitOfNextMarkedNumber
    :didStencilLeastDigitOfNextMarkedNumber?
    t nextNumberStenciled
    b exit
    
    # continue processing next number by adding 1
    :nextNumberStenciled
    b numberAdd1
    :numberAdded1
    
    # try putting a stencil on the rightmost digit of the next marked number on
    # the line and loop, otherwise we're done with the first marked number, we can
    # clean its marker, and we can loop
    b stencilNextNumber
    :didStencilNextNumber?
    t nextNumberStenciled
    b removeStencilAndFirstMarker
    :removeStencilAndFirstMarkerDone
    b stencilLeastDigitOfNextMarkedNumber
    
    ###########################################################
    # puts a \n on each side of the first digit marked on the right by \x0
    :stencilLeastDigitOfNextMarkedNumber
    tr
    :r
    s/([0-9])\x0;/\n\1\n\x0;/1
    b didStencilLeastDigitOfNextMarkedNumber?
    
    ###########################################################
    # makes desired sed script skeleton from space-separated numbers
    :makeSkeletonAndMarkNumbers
    s/$/ $/
    s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
    s/$/:e/
    b skeletonMade
    
    ###########################################################
    # moves the stencil to the next number followed by \x0
    :stencilNextNumber
    trr
    :rr
    s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
    b didStencilNextNumber?
    
    ###########################################################
    # +1 with carry to last digit on the line enclosed in between two \n characters
    :numberAdd1
    #i\
    #\nprima della somma:
    #l
    :digitPlus1
    h
    s/.*\n([0-9])\n.*/\1/
    y/0123456789/1234567890/
    G
    s/(.)\n(.*)\n.\n/\2\n\1\n/
    trrr
    :rrr
    /[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
    t digitPlus1
    # the following line can be problematic for lines starting with number
    /[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
    b numberAdded1
    
    ###########################################################
    # remove stencil and first marker on line
    :removeStencilAndFirstMarker
    s/\n(.)\n/\1/
    s/\x0//
    b removeStencilAndFirstMarkerDone
    
    ###########################################################
    :exit
    # a bit of post processing the `w` command has to be followed
    # by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
    s/(\{[^;]+);/\1\n/g
    
    0 讨论(0)
提交回复
热议问题