command line utility to print statistics of numbers in linux

后端未结

关注

 16  1557

I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.

Is there a

相关标签:

16条回答

囚心锁ツ

2020-11-30 19:01

Yep, it's called perl
and here is concise one-liner:

perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'

Example

$ cat tt
1
3
4
5
6.5
7.
2
3
4

And the command

cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
records:9
sum:35.5
avg:3.94444444444444
std:1.86256162380447
med:4
max:7.
min:1

0 讨论(0)

北恋

2020-11-30 19:03
You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

I/O options
- Input data can be from a file, standard input, or a pipe
- Output can be written to a file, standard output, or a pipe
- Output uses headers that start with "#" to enable piping to gnuplot
Parsing options
- Signal, end-of-file, or blank line based detection to stop processing
- Comment and delimiter character can be set
- Columns can be filtered out from processing
- Rows can be filtered out from processing based on numeric constraint
- Rows can be filtered out from processing based on string constraint
- Initial header rows can be skipped
- Fixed number of rows can be processed
- Duplicate delimiters can be ignored
- Rows can be reshaped into columns
- Strictly enforce that only rows of the same size are processed
- A row containing column titles can be used to title output statistics
Statistics options
- Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
- Covariance
- Correlation
- Least squares offset
- Least squares slope
- Histogram
- Raw data after filtering
NOTE: I'm the author.
0 讨论(0)
发布评论:

提交评论
- 加载中...

面向向阳花

2020-11-30 19:09

Mean:

awk '{sum += $1} END {print "mean = " sum/NR}' filename

Median:

gawk -v max=128 '

    function median(c,v,    j) { 
       asort(v,j) 
       if (c % 2) return j[(c+1)/2]
       else return (j[c/2+1]+j[c/2])/2.0
    }

    { 
       count++
       values[count]=$1
       if (count >= max) { 
         print  median(count,values); count=0
       } 
    } 

    END { 
       print  "median = " median(count,values)
    }
    ' filename

Mode:

awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename

This mode calculation requires an even number of samples, but you see how it works...

Standard Deviation:

awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename

0 讨论(0)

日久生厌

2020-11-30 19:09

I found myself wanting to do this in a shell pipeline, and getting all the right arguments for R took a while. Here's what I came up with:

seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)' Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00

The --slave option "Make(s) R run as quietly as possible...It implies --quiet and --no-save." The -e option tells R to treat the following string as R code. The first statement reads from standard in, and stores what's read in the variable called "x". The quiet=TRUE option to the scan function suppresses the writing of a line saying how many items were read. The second statement applies the summary function to x, which produces the output.

0 讨论(0)
发布评论:

提交评论
- 加载中...

爱一瞬间的悲伤

2020-11-30 19:09

#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and 
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# tchrist@perl.com

use strict;
use warnings;

use List::Util qw< min max >;

#
my $number_rx = qr{

  # leading sign, positive or negative
    (?: [+-] ? )

  # mantissa
    (?= [0123456789.] )
    (?: 
        # "N" or "N." or "N.N"
        (?:
            (?: [0123456789] +     )
            (?:
                (?: [.] )
                (?: [0123456789] * )
            ) ?
      |
        # ".N", no leading digits
            (?:
                (?: [.] )
                (?: [0123456789] + )
            ) 
        )
    )

  # abscissa
    (?:
        (?: [Ee] )
        (?:
            (?: [+-] ? )
            (?: [0123456789] + )
        )
        |
    )
}x;

my $n = 0;
my $sum = 0;
my @values = ();

my %seen = ();

while (<>) {
    while (/($number_rx)/g) {
        $n++;
        my $num = 0 + $1;  # 0+ is so numbers in alternate form count as same
        $sum += $num;
        push @values, $num;
        $seen{$num}++;
    } 
} 

die "no values" if $n == 0;

my $mean = $sum / $n;

my $sqsum = 0;
for (@values) {
    $sqsum += ( $_ ** 2 );
} 
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);

my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;

my $mode = @modes == 1 
            ? $modes[0] 
            : "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;

my $median;
my $mid = int @values/2;
if (@values % 2) {
    $median = $values[ $mid ];
} else {
    $median = ($values[$mid-1] + $values[$mid])/2;
} 

my $min = min @values;
my $max = max @values;

printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n", 
    $mode, $median, $mean, $stdev;

0 讨论(0)

既然无缘

2020-11-30 19:12
Another tool: tsv-summarize, from eBay's tsv utilities. Min, max, mean, median, standard deviation are all supported. Intended for large data sets. Example:
```
$ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
1    10    5.5    3.0276503541
```
Disclaimer: I'm the author.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页

command line utility to print statistics of numbers in linux

I/O options

Parsing options

Statistics options