Select / assign to data.table when variable names are stored in a character vector

后端 未结 5 1380
無奈伤痛
無奈伤痛 2020-11-22 03:54

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:

相关标签:
5条回答
  • 2020-11-22 04:18

    Retrieve multiple columns from data.table via variable or function:

    library(data.table)
    
    x <- data.table(this=1:2,that=1:2,whatever=1:2)
    
    # === explicit call
    x[, .(that, whatever)]
    x[, c('that', 'whatever')]
    
    # === indirect via  variable
    # ... direct assignment
    mycols <- c('that','whatever')
    # ... same as result of a function call
    mycols <- grep('a', colnames(x), value=TRUE)
    
    x[, ..mycols]
    x[, .SD, .SDcols=mycols]
    
    # === direct 1-liner usage
    x[, .SD, .SDcols=c('that','whatever')]
    x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
    

    which all yield

       that whatever
    1:    1        1
    2:    2        2
    

    I find the .SDcols way the most elegant.

    0 讨论(0)
  • 2020-11-22 04:22

    For multiple columns and a function applied on column values.

    When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.

    The example below converts integer columns to numeric columns

    a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
    sapply(a1, class)  # show classes of columns
    #         a           b          c1 
    # "integer"   "integer" "character" 
    
    # column name character vector
    nm <- c("a", "b")
    
    # Convert columns a and b to numeric type
    a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
    
    sapply(a1, class)
    #         a           b          c1 
    # "numeric"   "numeric" "character" 
    
    0 讨论(0)
  • 2020-11-22 04:27

    You could try this

    colname <- as.name("COL_NAME")

    DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]

    0 讨论(0)
  • 2020-11-22 04:41

    *This is not an answer really, but I don't have enough street cred to post comments :/

    Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?

    colname <- as.name("users")
    # Google Analytics query is run with chosen metric and resulting data is assigned to DT
    DT2 <- DT[, sum(eval(colname, .SD)), by = country]
    setnames(DT2, "V1", as.character(colname))
    

    Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.

    0 讨论(0)
  • 2020-11-22 04:45

    Two ways to programmatically select variable(s):

    1. with = FALSE:

       DT = data.table(col1 = 1:3)
       colname = "col1"
       DT[, colname, with = FALSE] 
       #    col1
       # 1:    1
       # 2:    2
       # 3:    3
      
    2. 'dot dot' (..) prefix:

       DT[, ..colname]    
       #    col1
       # 1:    1
       # 2:    2
       # 3:    3
      

    For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).

    To assign to variable(s), wrap the LHS of := in parentheses:

    DT[, (colname) := 4:6]    
    #    col1
    # 1:    4
    # 2:    5
    # 3:    6
    

    The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:

    Using with = FALSE with := is now deprecated in all cases, given that wrapping the LHS of := with parentheses has been preferred for some time.

    colVar = "col1"
    
    DT[, (colVar) := 1]                             # please change to this
    DT[, c("col1", "col2") := 1]                    # no change
    DT[, 2:4 := 1]                                  # no change
    DT[, c("col1","col2") := list(sum(a), mean(b))]  # no change
    DT[, `:=`(...), by = ...]                       # no change
    

    See also Details section in ?`:=`:

    DT[i, (colnamevector) := value]
    # [...] The parens are enough to stop the LHS being a symbol
    

    And to answer further question in comment, here's one way (as usual there are many ways) :

    DT[, colname := cumsum(get(colname)), with = FALSE]
    #    col1
    # 1:    4
    # 2:    9
    # 3:   15 
    

    or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :

    expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
    expr
    # [1] "DT[,col1:=cumsum(col1)]"
    
    eval(parse(text=expr))
    #    col1
    # 1:    4
    # 2:   13
    # 3:   28
    

    If you do that a lot, you can define a helper function EVAL :

    EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
    
    EVAL("DT[,",colname,":=cumsum(",colname,")]")
    #    col1
    # 1:    4
    # 2:   17
    # 3:   45
    

    Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.

    Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.

    set(DT, j = colname, value = cumsum(DT[[colname]]))
    DT
    #    col1
    # 1:    4
    # 2:   21
    # 3:   66
    
    0 讨论(0)
提交回复
热议问题