I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.
The way I do it now it\'s just cat fn
Try: sed -n '$=' filename
Also cat is unnecessary: wc -l filename
is enough in your present way.
Let us assume:
then you really want to chop the files into parts, count parts in parallel on multiple nodes and sum up the results from there (this is basically @Chris White's idea).
Here is how you do that with GNU Parallel (version > 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts
and you must have ssh
access to all of them:
parwc() {
# Usage:
# parwc -l file
# Give one chunck per host
chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l)
# Build commands that take a chunk each and do 'wc' on that
# ("map")
parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" |
# For each command
# log into a cluster host
# cd to current working dir
# execute the command
parallel -j0 --slf my_cluster_hosts --wd . |
# Sum up the number of lines
# ("reduce")
perl -ne '$sum += $_; END { print $sum,"\n" }'
}
Use as:
parwc -l myfile
parwc -w myfile
parwc -c myfile
As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines
time grep -c $ my_file.txt;
real 0m44.96s user 0m41.59s sys 0m3.09s
time wc -l my_file.txt;
real 0m37.57s user 0m33.48s sys 0m3.97s
time sed -n '$=' my_file.txt;
real 0m38.22s user 0m28.05s sys 0m10.14s
time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt
;
real 0m23.38s user 0m20.19s sys 0m3.11s
time awk 'END { print NR }' my_file.txt;
real 0m19.90s user 0m16.76s sys 0m3.12s
spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()
res1: org.joda.time.Seconds = PT15S
find -type f -name "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}'
Output:
If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"
Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:
$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc
I'm not sure that python is quicker:
[root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"
644306
real 0m0.310s
user 0m0.176s
sys 0m0.132s
[root@myserver scripts]# time cat mybigfile.txt | wc -l
644305
real 0m0.048s
user 0m0.017s
sys 0m0.074s