问题
I am using Datamash 1.7 on Centos 7.7 Linux x86_64 machine to sort and bin data which is 24 GB in size. Input data looks as follows (only first 50 samples)
Ind_poob
0.040618
0.006233
0.004652
0.003559
0.001752
0.001605
0.007701
0.004722
0.029899
0.00104
0.014031
6.1e-5
0.002144
0.002385
0.001145
0
0.001463
0
0.003414
0
0.001602
9.75e-4
0.007218
6.4e-5
0.006426
0
7.2e-5
1.13e-4
1.5e-4
0
4.19e-4
0.009325
7e-5
0.006592
0.01
0
0.001605
0.001924
0.003714
0.00335
0.001876
5.52e-4
0
0.019234
0.001415
1e-5
0
0.004304
2.15e-4
Desired Output (after scaling up)
#number bin_number
4061.8 4061.8
623.3 620.00
465.2 460.00
355.9 350.00
175.2 170.00
160.5 160.00
770.1 770.00
472.2 470.00
2989.9 2980.00
104 100.00
1403.1 1400.00
6.1 0.00
214.4 210.00
238.5 230.00
114.5 110.00
0 0.00
146.3 140.00
0 0.00
341.4 340.00
0 0.00
160.2 160.00
97.5 90.00
721.8 720.00
6.4 0.00
642.6 640.00
0 0.00
7.2 0.00
11.3 10.00
15 10.00
0 0.00
41.9 40.00
932.5 930.00
7 0.00
659.2 650.00
1000 1000.00
0 0.00
160.5 160.00
192.4 190.00
371.4 370.00
335 330.00
187.6 180.00
55.2 50.00
0 0.00
1923.4 1920.00
141.5 140.00
1 0.00
0 0.00
430.4 430.00
21.5 20.00
but with Datamash command: datamash -H --format=%.8f -s bin 1 <test_data.txt
, I am getting
bin(ind_poob)
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
How can I format datamash command to sort and bin input data with correct floating point format? Secondly, will it possible to plot it after binning using Gnuplot given the size of original input being 24 GB?
回答1:
Looking at the source (Since unfortunately binning isn't described very well in the documentation), numeric binning is done by this code:
const long double val = num_value / op->params.bin_bucket_size;
modfl (val, & op->value);
/* signbit will take care of negative-zero as well. */
if (signbit (op->value))
--op->value;
op->value *= op->params.bin_bucket_size;
Basically, it takes the integer part of dividing the number by the bucket size (where the default is 100), and multiplies that by the bucket size. So since all your numbers in your sample data are in the range [0,1)
, every one will be in the same 0 bucket.
You might try scaling your data by multiplying it by 1e4 (Or more) to see if that'll give you better numbers (Also, no need to sort the data - you can leave off the -s
option).
Another approach is to treat the values as strings, not numbers, and use strbin
, which uses a different algorithm that might work better for you:
$ datamash -H --full strbin:100 1 < test_data.txt
Ind_poob strbin(Ind_poob)
0.040618 60
0.006233 27
0.004652 70
0.003559 5
0.001752 30
0.001605 29
0.007701 37
0.004722 78
0.029899 25
0.00104 60
0.014031 17
6.1e-5 93
0.002144 84
0.002385 21
0.001145 57
...
来源:https://stackoverflow.com/questions/62970836/datamash-1-7-outputs-zero-on-floating-point-values-binning