command line utility to print statistics of numbers in linux

后端 未结 16 1546
無奈伤痛
無奈伤痛 2020-11-30 18:46

I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.

Is there a

相关标签:
16条回答
  • 2020-11-30 19:01

    Yep, it's called perl
    and here is concise one-liner:

    perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
    

    Example

    $ cat tt
    1
    3
    4
    5
    6.5
    7.
    2
    3
    4
    

    And the command

    cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
    records:9
    sum:35.5
    avg:3.94444444444444
    std:1.86256162380447
    med:4
    max:7.
    min:1
    
    0 讨论(0)
  • 2020-11-30 19:03

    You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

    I/O options

    • Input data can be from a file, standard input, or a pipe
    • Output can be written to a file, standard output, or a pipe
    • Output uses headers that start with "#" to enable piping to gnuplot

    Parsing options

    • Signal, end-of-file, or blank line based detection to stop processing
    • Comment and delimiter character can be set
    • Columns can be filtered out from processing
    • Rows can be filtered out from processing based on numeric constraint
    • Rows can be filtered out from processing based on string constraint
    • Initial header rows can be skipped
    • Fixed number of rows can be processed
    • Duplicate delimiters can be ignored
    • Rows can be reshaped into columns
    • Strictly enforce that only rows of the same size are processed
    • A row containing column titles can be used to title output statistics

    Statistics options

    • Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
    • Covariance
    • Correlation
    • Least squares offset
    • Least squares slope
    • Histogram
    • Raw data after filtering

    NOTE: I'm the author.

    0 讨论(0)
  • 2020-11-30 19:09

    Mean:

    awk '{sum += $1} END {print "mean = " sum/NR}' filename
    

    Median:

    gawk -v max=128 '
    
        function median(c,v,    j) { 
           asort(v,j) 
           if (c % 2) return j[(c+1)/2]
           else return (j[c/2+1]+j[c/2])/2.0
        }
    
        { 
           count++
           values[count]=$1
           if (count >= max) { 
             print  median(count,values); count=0
           } 
        } 
    
        END { 
           print  "median = " median(count,values)
        }
        ' filename
    

    Mode:

    awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename
    

    This mode calculation requires an even number of samples, but you see how it works...

    Standard Deviation:

    awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename
    
    0 讨论(0)
  • 2020-11-30 19:09

    I found myself wanting to do this in a shell pipeline, and getting all the right arguments for R took a while. Here's what I came up with:

    seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)' Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00

    The --slave option "Make(s) R run as quietly as possible...It implies --quiet and --no-save." The -e option tells R to treat the following string as R code. The first statement reads from standard in, and stores what's read in the variable called "x". The quiet=TRUE option to the scan function suppresses the writing of a line saying how many items were read. The second statement applies the summary function to x, which produces the output.

    0 讨论(0)
  • 2020-11-30 19:09
    #!/usr/bin/perl
    #
    # stdev - figure N, min, max, median, mode, mean, & std deviation
    #
    # pull out all the real numbers in the input
    # stream and run standard calculations on them.
    # they may be intermixed with other test, need
    # not be on the same or different lines, and 
    # can be in scientific notion (avagadro=6.02e23).
    # they also admit a leading + or -.
    #
    # Tom Christiansen
    # tchrist@perl.com
    
    use strict;
    use warnings;
    
    use List::Util qw< min max >;
    
    #
    my $number_rx = qr{
    
      # leading sign, positive or negative
        (?: [+-] ? )
    
      # mantissa
        (?= [0123456789.] )
        (?: 
            # "N" or "N." or "N.N"
            (?:
                (?: [0123456789] +     )
                (?:
                    (?: [.] )
                    (?: [0123456789] * )
                ) ?
          |
            # ".N", no leading digits
                (?:
                    (?: [.] )
                    (?: [0123456789] + )
                ) 
            )
        )
    
      # abscissa
        (?:
            (?: [Ee] )
            (?:
                (?: [+-] ? )
                (?: [0123456789] + )
            )
            |
        )
    }x;
    
    my $n = 0;
    my $sum = 0;
    my @values = ();
    
    my %seen = ();
    
    while (<>) {
        while (/($number_rx)/g) {
            $n++;
            my $num = 0 + $1;  # 0+ is so numbers in alternate form count as same
            $sum += $num;
            push @values, $num;
            $seen{$num}++;
        } 
    } 
    
    die "no values" if $n == 0;
    
    my $mean = $sum / $n;
    
    my $sqsum = 0;
    for (@values) {
        $sqsum += ( $_ ** 2 );
    } 
    $sqsum /= $n;
    $sqsum -= ( $mean ** 2 );
    my $stdev = sqrt($sqsum);
    
    my $max_seen_count = max values %seen;
    my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;
    
    my $mode = @modes == 1 
                ? $modes[0] 
                : "(" . join(", ", @modes) . ")";
    $mode .= ' @ ' . $max_seen_count;
    
    my $median;
    my $mid = int @values/2;
    if (@values % 2) {
        $median = $values[ $mid ];
    } else {
        $median = ($values[$mid-1] + $values[$mid])/2;
    } 
    
    my $min = min @values;
    my $max = max @values;
    
    printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
    printf "mode is %s, median is %g, mean is %g, stdev is %g\n", 
        $mode, $median, $mean, $stdev;
    
    0 讨论(0)
  • 2020-11-30 19:12

    Another tool: tsv-summarize, from eBay's tsv utilities. Min, max, mean, median, standard deviation are all supported. Intended for large data sets. Example:

    $ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
    1    10    5.5    3.0276503541
    

    Disclaimer: I'm the author.

    0 讨论(0)
提交回复
热议问题