Create bins with awk histogram-like

问题

Here's my input file :

I would like to create bins of a size of my choice to get histogram-like output, e.g. something like this for 0.1 bins, starting from 0 :

0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...

My file is too big for R, so I'm looking for an awk solution (also open to anything else that I can understand, as I'm still a Linux beginner).

This was sort of already answered in this post : awk histogram in buckets but the solution is not working for me.

回答1:

This is also possible :

awk -v size=0.1 
  '{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin }
   END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] }' <file>

It essentially does the same as the solution of EdMorton, but starts printing buckets from the minimum value which is default 0. It essentially takes negative numbers into account.

回答2:

This should be very close if not exactly right. Consider it a starting point at least and verify/figure out the math yourself (in particular decide/verify which bucket(s) an exact boundary match like 0.2 should go into - 0.1 to 0.2 and/or 0.2 to 0.3?):

$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
    bucketNr = int(($0+delta) / delta)
    cnt[bucketNr]++
    numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
    for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
        end = beg + delta
        printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
        beg = end
    }
}

$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1

Note that you can assign the bucket delta size on the command line, 0.1 is just the default value:

$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1

$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1

回答3:

Here is my stab at solving this with Awk.

To run: awk -f belowscript.awk inputfile

BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
    delta = (delta == "") ? 0.1 : delta;
};

/^-?([0-9][0-9]*|[0-9]*(\.[0-9][0-9]*))/ {
    # Special case the [-delta - 0] case so it doesn't bin in the [0-delta] bin
    fractBin=$1/delta
    if (fractBin < 0 && int(fractBin) == fractBin)
        fractBin = fractBin+1
    prefix = (fractBin <= 0 && int(fractBin) == 0) ? "-" : ""
    bins[prefix int(fractBin)]++
}

END {
    for (var in bins)
    {
        srange = sprintf("%0.2f",delta * ((var >= 0) ? var : var-1))
        erange = sprintf("%0.2f",delta * ((var >= 0) ? var+1 : var))
        print srange " " erange " " bins[var]
    }
}

Some notes:

I added support for providing the bin size on the command line like Ed Morton did.
It only prints the bins that contain something
Which bin an exact match goes in - the smaller or the larger bin naturally with this approach negated when going negative, and required tweaking to make it consistent.
the 0 boundary needed special casing for those numbers in the first negative bin, since there is no such number as -0. Awk's associative arrays use strings for keys, so "-0" was possible, and with @ind_num_asc sort order for the for loop, seems to sort the -0 properly - though this may not be portable.

来源：https://stackoverflow.com/questions/49737975/create-bins-with-awk-histogram-like

标签

bash

unix

dataframe

awk

grouping