I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.
The way I do it now it\'s just cat fn
Hadoop is essentially providing a mechanism to perform something similar to what @Ivella is suggesting.
Hadoop's HDFS (Distributed file system) is going to take your 20GB file and save it across the cluster in blocks of a fixed size. Lets say you configure the block size to be 128MB, the file would be split into 20x8x128MB blocks.
You would then run a map reduce program over this data, essentially counting the lines for each block (in the map stage) and then reducing these block line counts into a final line count for the entire file.
As for performance, in general the bigger your cluster, the better the performance (more wc's running in parallel, over more independent disks), but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc
On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.
find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc
To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.
find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc
If your bottleneck is the disk, it matters how you read from it. dd if=filename bs=128M | wc -l
is a lot faster than wc -l filename
or cat filename | wc -l
for my machine that has an HDD and fast CPU and RAM. You can play around with the block size and see what dd
reports as the throughput. I cranked it up to 1GiB.
Note: There is some debate about whether cat
or dd
is faster. All I claim is that dd
can be faster, depending on the system, and that it is for me. Try it for yourself.
If your computer has python, you can try this from the shell:
python -c "print len(open('test.txt').read().split('\n'))"
This uses python -c
to pass in a command, which is basically reading the file, and splitting by the "newline", to get the count of newlines, or the overall length of the file.
@BlueMoon's:
bash-3.2$ sed -n '$=' test.txt
519
Using the above:
bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))"
519
I know the question is a few years old now, but expanding on Ivella's last idea, this bash script estimates the line count of a big file within seconds or less by measuring the size of one line and extrapolating from it:
#!/bin/bash
head -2 $1 | tail -1 > $1_oneline
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_oneline | cut -f -1)
rm $1_oneline
echo $(expr $filesize / $linesize)
If you name this script lines.sh
, you can call lines.sh bigfile.txt
to get the estimated number of lines. In my case (about 6 GB, export from database), the deviation from the true line count was only 3%, but ran about 1000 times faster. By the way, I used the second, not first, line as the basis, because the first line had column names and the actual data started in the second line.
Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.
But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:
$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &
Notice the &
at each command line, so all will run in parallel; dd
works like cat
here, but allow us to specify how many bytes to read (count * bs
bytes) and how many to skip at the beginning of the input (skip * bs
bytes). It works in blocks, hence, the need to specify bs
as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.
If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.
EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.