How to split a large text file into smaller files with equal number of lines?

前端 未结 10 597
一整个雨季
一整个雨季 2020-11-22 16:42

I\'ve got a large (by number of lines) plain text file that I\'d like to split into smaller files, also by number of lines. So if my file has around 2M lines, I\'d like to

相关标签:
10条回答
  • 2020-11-22 17:28

    Yes, there is a split command. It will split a file by lines or bytes.

    $ split --help
    Usage: split [OPTION]... [INPUT [PREFIX]]
    Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
    size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
    is -, read standard input.
    
    Mandatory arguments to long options are mandatory for short options too.
      -a, --suffix-length=N   use suffixes of length N (default 2)
      -b, --bytes=SIZE        put SIZE bytes per output file
      -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
      -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
      -l, --lines=NUMBER      put NUMBER lines per output file
          --verbose           print a diagnostic just before each
                                output file is opened
          --help     display this help and exit
          --version  output version information and exit
    
    SIZE may have a multiplier suffix:
    b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
    GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
    
    0 讨论(0)
  • 2020-11-22 17:28

    HDFS getmerge small file and spilt into property size.

    This method will cause line break

    split -b 125m compact.file -d -a 3 compact_prefix
    

    I try to getmerge and split into about 128MB every file.

    # split into 128m ,judge sizeunit is M or G ,please test before use.
    
    begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
    sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
    if [ $sizeunit = "G" ];then
        res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
    else
        res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
    fi
    echo $res
    # split into $res files with number suffix.  ref  http://blog.csdn.net/microzone/article/details/52839598
    compact_file_name=$compact_file"_"
    echo "compact_file_name :"$compact_file_name
    split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}
    
    0 讨论(0)
  • 2020-11-22 17:38

    split the file "file.txt" into 10000 lines files:

    split -l 10000 file.txt
    
    0 讨论(0)
  • 2020-11-22 17:38

    split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

    -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
    
    CHUNKS may be:
      N       split into N files based on size of input
      K/N     output Kth of N to stdout
      l/N     split into N files without splitting lines/records
      l/K/N   output Kth of N to stdout without splitting lines/records
      r/N     like 'l' but use round robin distribution
      r/K/N   likewise but only output Kth of N to stdout
    

    Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

    If we want to preserve full lines (i.e. split by lines), then this should work:

    split -n l/4 input output.
    

    Related answer: https://stackoverflow.com/a/19031247

    0 讨论(0)
提交回复
热议问题