Count lines in large files

前端 未结 13 2217
挽巷
挽巷 2020-12-02 08:53

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

The way I do it now it\'s just cat fn

相关标签:
13条回答
  • 2020-12-02 09:26

    Try: sed -n '$=' filename

    Also cat is unnecessary: wc -l filename is enough in your present way.

    0 讨论(0)
  • 2020-12-02 09:26

    Let us assume:

    • Your file system is distributed
    • Your file system can easily fill the network connection to a single node
    • You access your files like normal files

    then you really want to chop the files into parts, count parts in parallel on multiple nodes and sum up the results from there (this is basically @Chris White's idea).

    Here is how you do that with GNU Parallel (version > 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts and you must have ssh access to all of them:

    parwc() {
        # Usage:
        #   parwc -l file                                                                
    
        # Give one chunck per host                                                     
        chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l)
        # Build commands that take a chunk each and do 'wc' on that                    
        # ("map")                                                                      
        parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" |
            # For each command                                                         
            #   log into a cluster host                                                
            #   cd to current working dir                                              
            #   execute the command                                                    
            parallel -j0 --slf my_cluster_hosts --wd . |
            # Sum up the number of lines                                               
            # ("reduce")                                                               
            perl -ne '$sum += $_; END { print $sum,"\n" }'
    }
    

    Use as:

    parwc -l myfile
    parwc -w myfile
    parwc -c myfile
    
    0 讨论(0)
  • 2020-12-02 09:28

    As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

    time grep -c $ my_file.txt;
    

    real 0m44.96s user 0m41.59s sys 0m3.09s

    time wc -l my_file.txt;
    

    real 0m37.57s user 0m33.48s sys 0m3.97s

    time sed -n '$=' my_file.txt;
    

    real 0m38.22s user 0m28.05s sys 0m10.14s

    time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

    real 0m23.38s user 0m20.19s sys 0m3.11s

    time awk 'END { print NR }' my_file.txt;
    

    real 0m19.90s user 0m16.76s sys 0m3.12s

    spark-shell
    import org.joda.time._
    val t_start = DateTime.now()
    sc.textFile("file://my_file.txt").count()
    val t_end = DateTime.now()
    new Period(t_start, t_end).toStandardSeconds()
    

    res1: org.joda.time.Seconds = PT15S

    0 讨论(0)
  • 2020-12-02 09:29
    find  -type f -name  "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}'
    

    Output:

    0 讨论(0)
  • 2020-12-02 09:31

    If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"
    

    Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:

    $HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc
    
    0 讨论(0)
  • 2020-12-02 09:32

    I'm not sure that python is quicker:

    [root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"
    
    644306
    
    
    real    0m0.310s
    user    0m0.176s
    sys     0m0.132s
    
    [root@myserver scripts]# time  cat mybigfile.txt  | wc -l
    
    644305
    
    
    real    0m0.048s
    user    0m0.017s
    sys     0m0.074s
    
    0 讨论(0)
提交回复
热议问题