What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?
Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc
binary does this.
#!/usr/bin/env perl
use constant BLOCK_SIZE => 16384;
for my $file (@ARGV) {
open my $fh, '<', $file or do {
warn "couldn't open $file: $!\n";
continue;
};
my ($chars, $words, $lines) = (0, 0, 0);
my ($new_word, $new_line);
while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
$chars += $size;
$words += /\s+/g;
$words-- if $new_word && /\A\s/;
$lines += () = /\n/g;
$new_word = /\s\Z/;
$new_line = /\n\Z/;
}
$lines-- if $new_line;
print "\t$lines\t$words\t$chars\t$file\n";
}
Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\s+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\W+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.
This may be helpful to Perl beginners. I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.
Here is the url: Counting words,characters and lines in a file