command line utility to print statistics of numbers in linux

后端 未结 16 1547
無奈伤痛
無奈伤痛 2020-11-30 18:46

I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.

Is there a

相关标签:
16条回答
  • 2020-11-30 19:13

    For the average, median & standard deviation you can use awk. This will generally be faster than R solutions. For instance the following will print the average :

    awk '{a+=$1} END{print a/NR}' myfile
    

    (NR is an awk variable for the number of records, $1 means the first (space-separated) argument of the line ($0 would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and END means that the following commands will be executed after having processed the whole file (one could also have initialized a to 0 in a BEGIN{a=0} statement)).

    Here is a simple awk script which provides more detailed statistics (takes a CSV file as input, otherwise change FS) :

    #!/usr/bin/awk -f
    
    BEGIN {
        FS=",";
    }
    {
       a += $1;
       b[++i] = $1;
    }
    END {
        m = a/NR; # mean
        for (i in b)
        {
            d += (b[i]-m)^2;
            e += (b[i]-m)^3;
            f += (b[i]-m)^4;
        }
        va = d/NR; # variance
        sd = sqrt(va); # standard deviation
        sk = (e/NR)/sd^3; # skewness
        ku = (f/NR)/sd^4-3; # standardized kurtosis
        print "N,sum,mean,variance,std,SEM,skewness,kurtosis"
        print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku
    }
    

    It is straightforward to add min/max to this script, but it is as easy to pipe sort & head/tail :

    sort -n myfile | head -n1
    sort -n myfile | tail -n1
    
    0 讨论(0)
  • 2020-11-30 19:13

    Yet another tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu. Or you can simply download and build it from sources - it only requires a C compiler and the C standard library.

    Usage example:

    $ cat test.log 
    Handled 1000000 packets.Time elapsed: 7.575278
    Handled 1000000 packets.Time elapsed: 7.569267
    Handled 1000000 packets.Time elapsed: 7.540344
    Handled 1000000 packets.Time elapsed: 7.547680
    Handled 1000000 packets.Time elapsed: 7.692373
    Handled 1000000 packets.Time elapsed: 7.390200
    Handled 1000000 packets.Time elapsed: 7.391308
    Handled 1000000 packets.Time elapsed: 7.388075
    
    $ cat test.log| awk '{print $5}' | ministat -w 74
    x <stdin>
    +--------------------------------------------------------------------------+
    | x                                                                        |
    |xx                                   xx    x x                           x|
    |   |__________________________A_______M_________________|                 |
    +--------------------------------------------------------------------------+
        N           Min           Max        Median           Avg        Stddev
    x   8      7.388075      7.692373       7.54768     7.5118156    0.11126122
    
    0 讨论(0)
  • 2020-11-30 19:15

    This is a breeze with R. For a file that looks like this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    

    Use this:

    R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"
    

    To get this:

           V1       
     Min.   : 1.00  
     1st Qu.: 3.25  
     Median : 5.50  
     Mean   : 5.50  
     3rd Qu.: 7.75  
     Max.   :10.00  
    [1] 3.02765
    
    • The -q flag squelches R's startup licensing and help output
    • The -e flag tells R you'll be passing an expression from the terminal
    • x is a data.frame - a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.
    • Some functions, like summary(), naturally accommodate data.frames. If x had multiple fields, summary() would provide the above descriptive stats for each.
    • But sd() can only take one vector at a time, which is why I index x for that command (x[ , 1] returns the first column of x). You could use apply(x, MARGIN = 2, FUN = sd) to get the SDs for all columns.
    0 讨论(0)
  • 2020-11-30 19:15

    Using "st" (https://github.com/nferraz/st)

    $ st numbers.txt
    N    min   max   sum   mean  stddev
    10   1     10    55    5.5   3.02765
    

    Or:

    $ st numbers.txt --transpose
    N      10
    min    1
    max    10
    sum    55
    mean   5.5
    stddev 3.02765
    

    (DISCLAIMER: I wrote this tool :))

    0 讨论(0)
  • 2020-11-30 19:15

    Just in case, there's datastat, a simple program for Linux computing simple statistics from the command-line. For example,

    cat file.dat | datastat
    

    will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --min and --max options, respectively.

    datastat has the possibility to aggregate rows based on the value of one or more "key" columns. For example,

    cat file.dat | datastat -k 1
    

    will produce, for each different value found on the first column (the "key"), the average of all other column values as aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g., -k 1-3, -k 2,4 etc...).

    It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awk etc.

    0 讨论(0)
  • 2020-11-30 19:18

    Using xsv:

    $ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv
    
    $ xsv stats -n < numbers-one-per-line.csv 
    field,type,sum,min,max,min_length,max_length,mean,stddev
    0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863
    
    # mode/median/cardinality not shown by default since it requires storing full file in memory:
    $ xsv stats -n --everything < numbers-one-per-line.csv | xsv table
    field  type     sum  min  max  min_length  max_length  mean               stddev              median  mode  cardinality
    0      Integer  53   1    9    1           1           4.416666666666667  2.5644470922381863  4.5     5     7
    
    0 讨论(0)
提交回复
热议问题