问题
Here's my input file :
1.37987
1.21448
0.624999
1.28966
1.77084
1.088
1.41667
I would like to create bins of a size of my choice to get histogram-like output, e.g. something like this for 0.1 bins, starting from 0 :
0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...
My file is too big for R, so I'm looking for an awk solution (also open to anything else that I can understand, as I'm still a Linux beginner).
This was sort of already answered in this post : awk histogram in buckets but the solution is not working for me.
回答1:
This is also possible :
awk -v size=0.1
'{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin }
END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] }' <file>
It essentially does the same as the solution of EdMorton, but starts printing buckets from the minimum value which is default 0
. It essentially takes negative numbers into account.
回答2:
This should be very close if not exactly right. Consider it a starting point at least and verify/figure out the math yourself (in particular decide/verify which bucket(s) an exact boundary match like 0.2
should go into - 0.1 to 0.2 and/or 0.2 to 0.3?):
$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1
Note that you can assign the bucket delta size on the command line, 0.1 is just the default value:
$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1
$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1
回答3:
Here is my stab at solving this with Awk.
To run: awk -f belowscript.awk inputfile
BEGIN {
PROCINFO["sorted_in"]="@ind_num_asc";
delta = (delta == "") ? 0.1 : delta;
};
/^-?([0-9][0-9]*|[0-9]*(\.[0-9][0-9]*))/ {
# Special case the [-delta - 0] case so it doesn't bin in the [0-delta] bin
fractBin=$1/delta
if (fractBin < 0 && int(fractBin) == fractBin)
fractBin = fractBin+1
prefix = (fractBin <= 0 && int(fractBin) == 0) ? "-" : ""
bins[prefix int(fractBin)]++
}
END {
for (var in bins)
{
srange = sprintf("%0.2f",delta * ((var >= 0) ? var : var-1))
erange = sprintf("%0.2f",delta * ((var >= 0) ? var+1 : var))
print srange " " erange " " bins[var]
}
}
Some notes:
- I added support for providing the bin size on the command line like Ed Morton did.
- It only prints the bins that contain something
- Which bin an exact match goes in - the smaller or the larger bin naturally with this approach negated when going negative, and required tweaking to make it consistent.
- the 0 boundary needed special casing for those numbers in the first negative bin, since there is no such number as -0. Awk's associative arrays use strings for keys, so "-0" was possible, and with @ind_num_asc sort order for the for loop, seems to sort the -0 properly - though this may not be portable.
来源:https://stackoverflow.com/questions/49737975/create-bins-with-awk-histogram-like