How do I count the characters, words, and lines in a file, using Perl?

前端 未结 10 1338
醉酒成梦
醉酒成梦 2020-12-31 03:21

What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?

相关标签:
10条回答
  • 2020-12-31 03:28

    There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.

    It gets a little ridiculous when you look at the GNU compliant implementation of true.

    0 讨论(0)
  • 2020-12-31 03:29

    Non-serious answer:

    system("wc foo");
    
    0 讨论(0)
  • 2020-12-31 03:30

    Here. Try this Unicode-savvy version of the wc program.

    • It skips non-file arguments (pipes, directories, sockets, etc).

    • It assumes UTF-8 text.

    • It counts any Unicode whitespace as a word separator.

    • It also accepts alternate encodings if there is a .ENCODING at the end of the filename, like foo.cp1252, foo.latin1, foo.utf16, etc.

    • It also work with files that have been compressed in a variety of formats.

    • It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.

    • It understands all Unicode linebreak sequences.

    • It warns about corrupted textfiles with linebreak errors.

    Here’s an example of running it:

       Paras    Lines    Words   Graphs    Chars    Bytes File
           2     2270    82249   504169   504333   528663 /tmp/ap
           1     2404    11163    63164    63164    66336 /tmp/b3
        uwc: missing linebreak at end of corrupted textfiile /tmp/bad
          1*       2*        4       19       19       19 /tmp/bad
           1       14       52      273      273      293 /tmp/es
          57      383     1369    11997    11997    12001 /tmp/funny
           1   657068  3175429 31205970 31209138 32633834 /tmp/lw
           1        1        4       27       27       27 /tmp/nf.cp1252
           1        1        4       27       27       34 /tmp/nf.euc-jp
           1        1        4       27       27       27 /tmp/nf.latin1
           1        1        4       27       27       27 /tmp/nf.macroman
           1        1        4       27       27       54 /tmp/nf.ucs2
           1        1        4       27       27       56 /tmp/nf.utf16
           1        1        4       27       27       54 /tmp/nf.utf16be
           1        1        4       27       27       54 /tmp/nf.utf16le
           1        1        4       27       27      112 /tmp/nf.utf32
           1        1        4       27       27      108 /tmp/nf.utf32be
           1        1        4       27       27      108 /tmp/nf.utf32le
           1        1        4       27       27       39 /tmp/nf.utf7
           1        1        4       27       27       31 /tmp/nf.utf8
           1    26906   101528   635841   636026   661202 /tmp/o2
         131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
         291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
           1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
           1       89      334     1784     1784     2094 /tmp/til
           1        4       18       88       88      106 /tmp/w
         276     1736     5773    53782    53782    53804 /tmp/www
    

    Here ya go:

    #!/usr/bin/env perl 
    #########################################################################
    # uniwc - improved version of wc that works correctly with Unicode
    #
    # Tom Christiansen <tchrist@perl.com>
    # Mon Feb 28 15:59:01 MST 2011
    #########################################################################
    
    use 5.10.0;
    
    use strict;
    use warnings FATAL => "all";
    use sigtrap qw[ die untrapped normal-signals ];
    
    use Carp;
    
    $SIG{__WARN__}  = sub {
        confess("FATALIZED WARNING: @_")  unless $^S;
    };
    
    $SIG{__DIE__}  = sub {
        confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
    };
    
    $| = 1;
    
    my $Errors = 0;
    my $Headers = 0;
    
    sub yuck($) {
        my $errmsg = $_[0];
        $errmsg =~ s/(?<=[^\n])\z/\n/;
        print STDERR "$0: $errmsg";
    }
    
    process_input(\&countem);
    
    sub countem { 
        my ($_, $file) = @_;
    
        my (
            @paras, @lines, @words,
            $paracount, $linecount, $wordcount, 
            $grafcount, $charcount, $bytecount,
        );
    
        if ($charcount = length($_)) {
            $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
            yuck "error splitting words: $@" if $@;
    
            $linecount = eval { @lines = split m{ \R     }x }; 
            yuck "error splitting lines: $@" if $@;
    
            $grafcount = 0;
            $grafcount++ while /\X/g;
            #$grafcount = eval { @lines = split m{ \R     }x }; 
            yuck "error splitting lines: $@" if $@;
    
            $paracount = eval { @paras = split m{ \R{2,} }x }; 
            yuck "error splitting paras: $@" if $@;
    
            if ($linecount && !/\R\z/) {
                yuck("missing linebreak at end of corrupted textfiile $file");
                $linecount .= "*";
                $paracount .= "*";
            } 
        }
    
        $bytecount = tell;
        if (-e $file) {
            $bytecount = -s $file;
            if ($bytecount != -s $file) {
                yuck "filesize of $file differs from bytecount\n";
                $Errors++;
            }
        } 
        my $mask = "%8s " x 6 . "%s\n";
        printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;
    
        printf $mask => map( { show_undef($_) } 
                                    $paracount, $linecount, 
                                    $wordcount, $grafcount, 
                                    $charcount, $bytecount,
                           ), $file;
    } 
    
    sub show_undef {
        my $value = shift;
        return defined($value)
                 ? $value
                 : "undef";
    } 
    
    END { 
        close(STDOUT) || die "$0: can't close STDOUT: $!";
        exit($Errors != 0);
    }
    
    sub process_input {
    
        my $function = shift();
    
        my $enc;
    
        if (@ARGV == 0 && -t) {
            warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
        }
    
        unshift(@ARGV, "-") if @ARGV == 0;
    
    FILE:
    
        for my $file (@ARGV) {
            # don't let magic open make an output handle
    
            next if -e $file && ! -f _;
    
            my $quasi_filename = fix_extension($file);
    
            $file = "standard input" if $file eq q(-);
            $quasi_filename =~ s/^(?=\s*[>|])/< /;
    
            no strict "refs";
            my $fh = $file;   # is *so* a lexical filehandle! ☺
            unless (open($fh, $quasi_filename)) {
                yuck("couldn't open $quasi_filename: $!");
                next FILE;
            }
            set_encoding($fh, $file) || next FILE;
    
            my $whole_file = eval {
                use warnings "FATAL" => "all";
                local $/;
                scalar <$fh>;
            };
    
            if ($@) {
                $@ =~ s/ at \K.*? line \d+.*/$file line $./;
                yuck($@);
                next FILE;
            }
    
            $function->($whole_file, $file);
    
            unless (close $fh) {
                yuck("couldn't close $quasi_filename at line $.: $!");
                next FILE;
            }
    
        } # foreach file
    
    }
    
    sub set_encoding(*$) {
        my ($handle, $path) = @_;
    
        my $enc_name = "utf8";
    
        if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
            my $ext = $1;
            die unless defined $ext;
            require Encode;
            if (my $enc_obj = Encode::find_encoding($ext)) {
                my $name = $enc_obj->name || $ext;
                $enc_name = "encoding($name)";
            }
        }
    
        return 1 if eval {
            use warnings FATAL => "all";
            no strict "refs";
            binmode($handle, ":$enc_name");
            1;
        };
    
        for ($@) {
            s/ at .* line \d+\.//;
            s/$/ for $path/;
        }
    
        yuck("set_encoding: $@");
    
        return undef;
    }
    
    sub fix_extension {
        my $path = shift();
        my %Compress = (
            Z       =>  "zcat",
            z       => "gzcat",            # for uncompressing
            gz      => "gzcat",
            bz      => "bzcat",
            bz2     => "bzcat",
            bzip    => "bzcat",
            bzip2   => "bzcat",
            lzma    => "lzcat",
        );
    
        if ($path =~ m{ \. ( [^.\s] +) \z }x) {
            if (my $prog = $Compress{$1}) {
                return "$prog $path |";
            } 
        } 
    
        return $path;
    

    }

    0 讨论(0)
  • 2020-12-31 03:31

    The Word Count tool counts characters, words and lines in text files

    0 讨论(0)
  • 2020-12-31 03:31

    To be able to count CHARS and not bytes, consider this:
    (Try it with Chinese or Cyrillic letters and file saved in utf8)

    use utf8;
    
    my $file='file.txt';
    my $LAYER = ':encoding(UTF-8)';
    open( my $fh, '<', $file )
      || die( "$file couldn't be opened: $!" );
    binmode( $fh, $LAYER );
    read $fh, my $txt, -s $file;
    close $fh;
    
    print length $txt,$/;
    use bytes;
    print length $txt,$/;
    
    0 讨论(0)
  • 2020-12-31 03:32

    I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

    First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to @_ is deprecated"? ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

    Here's the code:

    open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";
    
    my ($lines, $words, $chars) = (0,0,0);
    my @wordcounter;
    while (<FILE>) {
        chomp($_);
        $chars += length($_);
        @wordcounter = split(/\W+/, $_);
        $words += @wordcounter;
    }
    $lines = $.;
    close FILE;
    print "\nlines=$lines, words=$words, chars=$chars\n";
    
    0 讨论(0)
提交回复
热议问题