Round to a multiple and filter in data.table

前端 未结 2 1164
滥情空心
滥情空心 2020-12-21 05:36

I have very interesting problem, though I\'d rather not to have one. I have to round a number to a closes multiple so I followed the solution here It used to work OK, until

相关标签:
2条回答
  • 2020-12-21 05:56

    Just to add to @Tens great answer.

    What seem to be happening are three things

    • You have a floating point issue (as mentioned already)
    • You are using and old data.table version
    • Secondary indices are kicking in while you aren't aware of it

    Using your setup

    library(data.table)
    options(digits = 20) # to see number representation
    
    mround <- function (number, multiple) {
      return(multiple * round(number / multiple))
    }
    
    DT = data.table(a = mround(112.3, 0.1), b = "B")
    

    So lets address the points above. Since you have a floating point and quoting ?setNumericRounding

    Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type 'numeric'; i.e. 'double

    This led data.table devs to implement the setNumericRounding which auto rounded floats so a the radix algorithm would behave as expected.

    Prior to v1.9.8, setNumericRounding(2) was the default (hence your first example works), but after some complaints from users for inconsistency on GH (IIRC), since v1.9.8 the default was set back to setNumericRounding(0) in order to be consistent with data.frame behavior (see here). So if you'll update your data.table to the latest version, you will see that both data.table and data.frame will behave the same for your both examples (and both of your examples will fail).

    Compare

    setNumericRounding(0)
    DT[a == 112.3]
    ## Empty data.table (0 rows) of 2 cols: a,b
    

    To

    setNumericRounding(1)
    DT[a == 112.3]
    #                     a b
    # 1: 112.30000000000001 B
    

    So you will ask, "what on earth radix algorithm has to do with anything here". So here we reach the third point above- secondary indices (please read this). Lets see what actually happens when you are running you code above

    options(datatable.verbose = TRUE)
    DT[a == 112.3] # works as expected, i.e returns one row
    # Creating new index 'a' <~~~~
    # forder took 0 sec
    # Starting bmerge ...done in 0 secs
    #                     a b
    # 1: 112.30000000000001 B
    

    Lets checks the new secondary indices

    indices(DT)
    #[1] "a"
    

    when you've ran ==, data.table set a as your secondary index in order to perform future operations much more efficiently (this was introduced in v1.9.4, see here). In other words, you performed a binary join on a instead the usual vector scan like it was prior v1.9.4 (As a side note, this can be disabled by doing options(datatable.auto.index = FALSE), in that case, none of your examples will work even with setNumericRounding(1) unless you will explicitly specify a key using setkey or the on argument)

    This is probably will also explain why

    DT[a == 112.30000 & b == 'B'] 
    

    doesn't work. You are sub-setting here by two columns and neither secondary indices or binary join don't (automatically) kick-in for an expressions such as == & == (yet), hence you did a normal vector scan and setNumericRounding(1) didn't kick in

    Though, you can set the keys manually and make it work, for instance (like I commented under @Tens answer), you can do

    setNumericRounding(1) # make sure autoroundings turned on
    DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
    # Calculated ad hoc index in 0 secs
    # Starting bmerge ...done in 0 secs
    #        a b
    # 1: 112.3 B
    

    Or using the old way

    setkey(DT, a, b)
    DT[.(112.3, 'B'), nomatch = 0L]
    # Starting bmerge ...done in 0 secs
    #        a b
    # 1: 112.3 B
    
    0 讨论(0)
  • 2020-12-21 06:02

    It's a problem of floating point precision. See DT[abs(a - 112.3)<1.e-6 & b == 'B',] using an error margin of 0.000001 will give you proper result.

    If you want more precision you can use .Machine$double.eps^0.5 as does all.equal.

    General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here

    One way to fix your problem could be to refactor your function to:

    mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {
    
       round(multiple * round(number / multiple),digits)
    }
    

    I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).
    I removed the unnecessary return which just cause the interpreter to look for a function already called at end of the function call.

    This way your output should be precise enough, but you'll still have corner cases:

    > mround(112.37,0.2)
    [1] 112.40000000000001
    

    To use floats in joins, you can use (courtesy of David Arenburg):

    setNumericRounding(1)
    DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
    
    0 讨论(0)
提交回复
热议问题