Loop through data.table and create new columns basis some condition

后端未结

关注

 3  1946

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for

相关标签:

3条回答

离开以前

2021-01-13 05:18

For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:

library(dplyr)

result <- DT %>%
    group_by(town,tc) %>%
    mutate_each(funs(mean,sd,
                     uplimit = (mean(.) + 1.96*sd(.)),
                     lowlimit = (mean(.) - 1.96*sd(.)),
                     Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
                                               . <= mean(.) - 1.96*sd(.))),
                -town,-tc)

0 讨论(0)

抹茶落季

2021-01-13 05:23
Your data should probably be in long format:
```
m = melt(DT, id=c("town","tc"))
```
Then just write your test once
```
m[, 
  is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
```
I see no outliers in this data (according to the given definition of outlier):
```
m[, .N, by=is_outlier] # this is a handy alternative to table()

#    is_outlier   N
# 1:          0 160
```
How it works
- melt keeps the id columns and stacks all the rest into
  - variable (column names)
  - value (column contents)
- +x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
```
vjs = setdiff(names(DT), c("town","tc"))
DT[, 
  paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

情歌与酒

2021-01-13 05:25

We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.

nm <- names(DT)[-(1:2)]

nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))

DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
                                  SD = sd(x)
                     uplimit = Mean + 1.96*SD
                     lowlimit = Mean - 1.96*SD
             list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
                    .(town, tc)]

The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.

m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"), 
          with = FALSE] & DT[, nm, with = FALSE] <= DT[, 
            paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]

Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)

nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
 set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."), 
   value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] & 
           DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
 }

The 'log' columns can also be created with :=

DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]

0 讨论(0)