data.table join and j-expression unexpected behavior

后端 未结 4 1595
无人及你
无人及你 2020-12-17 03:41

In R 2.15.0 and data.table 1.8.9:

d = data.table(a = 1:5, value = 2:6, key = \"a\")

d[J(3), value]
#   a value
#   3     4

d[J(3)         


        
相关标签:
4条回答
  • 2020-12-17 04:13

    As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:

    d = data.table(a = 1:5, value = 2:6, key = "a")
    
    d[J(3), value]
    #[1] 4
    
    d[J(3), value, by = .EACHI]
    #   a value
    #1: 3     4
    

    And here's a slightly more complicated example, illustrating the difference:

    d = data.table(a = 1:2, b = 1:6, key = 'a')
    #   a b
    #1: 1 1
    #2: 1 3
    #3: 1 5
    #4: 2 2
    #5: 2 4
    #6: 2 6
    
    # normal join
    d[J(c(1,2)), sum(b)]
    #[1] 21
    
    # join with a by-without-by, or by-each-i
    d[J(c(1,2)), sum(b), by = .EACHI]
    #   a V1
    #1: 1  9
    #2: 2 12
    
    # and a more complicated example:
    d[J(c(1,2,1)), sum(b), by = .EACHI]
    #   a V1
    #1: 1  9
    #2: 2 12
    #3: 1  9
    
    0 讨论(0)
  • 2020-12-17 04:13

    This is not unexpected behaviour, it is documented behaviour. Arun has done a good job of explaining and demonstrating in the FAQ where this is clearly documented.

    there is a feature request FR 1757 that proposes the use of the drop argument in this case

    When implemented, the behaviour you want might be coded

    d = data.table(a = 1:5, value = 2:6, key = "a")
    
    d[J(3), value, drop = TRUE]
    
    0 讨论(0)
  • 2020-12-17 04:16

    Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).

    1.12 What is the difference between X[Y] and merge(X,Y)?
    X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
    You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?


    Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).

    When you give a value for j like d[, 4] or d[, value] in data.table, the j is evaluated as an expression. From the data.table FAQ 1.1 on accessing DT[, 5] (the very first FAQ) :

    Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.

    The first thing, therefore, to understand is, in your case:

    d[, value] # produces a "vector"
    # [1] 2 3 4 5 6
    

    This is not different when the query for i is a basic indexing like:

    d[3, value] # produces a vector of length 1
    # [1] 4
    

    However, this is different when i is by itself a data.table. From data.table introduction (page 6):

    d[J(3)] # is equivalent to d[data.table(a = 3)]
    

    Here, you are performing a join. If you just do d[J(3)] then you'd get all columns corresponding to that join. If you do,

    d[J(3), value] # which is equivalent to d[J(3), list(value)]
    

    Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).


    Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset and a index-based-subsetting operation.

    Your second syntax is equivalent to:

    d[J(3)][, value] # is equivalent to:
    
    dd <- d[J(3)]
    dd[, value]
    

    where, again, in dd[, value], j is evaluated as an expression and therefore you get a vector.


    To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.tables, I'd expect a data.table

    From data.table introduction, once again:

    Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.

    0 讨论(0)
  • 2020-12-17 04:26

    I agree with Arun's answer. Here's another wording: After you do a join, you often will use the join column as a reference or as an input to further transformation. So you keep it, and you have an option to discard it with the (more roundabout) double [ syntax. From a design perspective, it is easier to keep frequently relevant information and then discard when desired, than to discard early and risk losing data that is difficult to reconstruct.

    Another reason that you'd want to keep the join column is that you can perform aggregate operations at the same time as you perform a join (the by without by). For example, the results here are much clearer by including the join column:

    d <- data.table(a=rep.int(1:3,2),value=2:7,other=100:105,key="a")
    d[J(1:3),mean(value)]
    #   a  V1
    #1: 1 3.5
    #2: 2 4.5
    #3: 3 5.5
    
    0 讨论(0)
提交回复
热议问题