Loop through data.table and create new columns basis some condition

后端 未结 3 1938
无人共我
无人共我 2021-01-13 04:34

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for

相关标签:
3条回答
  • 2021-01-13 05:18

    For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:

    library(dplyr)
    
    result <- DT %>%
        group_by(town,tc) %>%
        mutate_each(funs(mean,sd,
                         uplimit = (mean(.) + 1.96*sd(.)),
                         lowlimit = (mean(.) - 1.96*sd(.)),
                         Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
                                                   . <= mean(.) - 1.96*sd(.))),
                    -town,-tc)
    
    0 讨论(0)
  • 2021-01-13 05:23

    Your data should probably be in long format:

    m = melt(DT, id=c("town","tc"))
    

    Then just write your test once

    m[, 
      is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
    , by=.(town, tc, variable)]
    

    I see no outliers in this data (according to the given definition of outlier):

    m[, .N, by=is_outlier] # this is a handy alternative to table()
    
    #    is_outlier   N
    # 1:          0 160
    

    How it works

    • melt keeps the id columns and stacks all the rest into
      • variable (column names)
      • value (column contents)
    • +x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0

    If you really like your data in wide format, though:

    vjs = setdiff(names(DT), c("town","tc"))
    DT[, 
      paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
    , by=.(town, tc), .SDcols=vjs]
    
    0 讨论(0)
  • 2021-01-13 05:25

    We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.

    nm <- names(DT)[-(1:2)]
    
    nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
    
    DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
                                      SD = sd(x)
                         uplimit = Mean + 1.96*SD
                         lowlimit = Mean - 1.96*SD
                 list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
                        .(town, tc)]
    

    The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.

    m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"), 
              with = FALSE] & DT[, nm, with = FALSE] <= DT[, 
                paste("uplimit", nm, sep="_"), with = FALSE])
    DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
    

    Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)

    nm2 <- paste(nm, "Aoutlier", sep=".")
    DT[, (nm2) := NA_integer_]
    for(j in nm){
     set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."), 
       value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] & 
               DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
     }
    

    The 'log' columns can also be created with :=

    DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
    
    0 讨论(0)
提交回复
热议问题