Create bins with awk histogram-like

浪子不回头ぞ 提交于 2021-02-08 07:51:17

问题


Here's my input file :

1.37987
1.21448
0.624999
1.28966
1.77084
1.088
1.41667

I would like to create bins of a size of my choice to get histogram-like output, e.g. something like this for 0.1 bins, starting from 0 :

0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...

My file is too big for R, so I'm looking for an awk solution (also open to anything else that I can understand, as I'm still a Linux beginner).

This was sort of already answered in this post : awk histogram in buckets but the solution is not working for me.


回答1:


This is also possible :

awk -v size=0.1 
  '{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin }
   END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] }' <file>

It essentially does the same as the solution of EdMorton, but starts printing buckets from the minimum value which is default 0. It essentially takes negative numbers into account.




回答2:


This should be very close if not exactly right. Consider it a starting point at least and verify/figure out the math yourself (in particular decide/verify which bucket(s) an exact boundary match like 0.2 should go into - 0.1 to 0.2 and/or 0.2 to 0.3?):

$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
    bucketNr = int(($0+delta) / delta)
    cnt[bucketNr]++
    numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
    for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
        end = beg + delta
        printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
        beg = end
    }
}

$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1

Note that you can assign the bucket delta size on the command line, 0.1 is just the default value:

$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1

$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1



回答3:


Here is my stab at solving this with Awk.

To run: awk -f belowscript.awk inputfile

BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
    delta = (delta == "") ? 0.1 : delta;
};

/^-?([0-9][0-9]*|[0-9]*(\.[0-9][0-9]*))/ {
    # Special case the [-delta - 0] case so it doesn't bin in the [0-delta] bin
    fractBin=$1/delta
    if (fractBin < 0 && int(fractBin) == fractBin)
        fractBin = fractBin+1
    prefix = (fractBin <= 0 && int(fractBin) == 0) ? "-" : ""
    bins[prefix int(fractBin)]++
}

END {
    for (var in bins)
    {
        srange = sprintf("%0.2f",delta * ((var >= 0) ? var : var-1))
        erange = sprintf("%0.2f",delta * ((var >= 0) ? var+1 : var))
        print srange " " erange " " bins[var]
    }
}

Some notes:

  • I added support for providing the bin size on the command line like Ed Morton did.
  • It only prints the bins that contain something
  • Which bin an exact match goes in - the smaller or the larger bin naturally with this approach negated when going negative, and required tweaking to make it consistent.
  • the 0 boundary needed special casing for those numbers in the first negative bin, since there is no such number as -0. Awk's associative arrays use strings for keys, so "-0" was possible, and with @ind_num_asc sort order for the for loop, seems to sort the -0 properly - though this may not be portable.


来源:https://stackoverflow.com/questions/49737975/create-bins-with-awk-histogram-like

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!