Can I programmatically update the type of a set of columns (to factors) in data.table?

ⅰ亾dé卋堺 提交于 2019-12-23 09:50:06

问题


I would like to modify a set of columns inside a data.table to be factors. If I knew the names of the columns in advance, I think this would be straightforward.

library(data.table)
dt1  <- data.table(a = (1:4), b = rep(c('a','b')), c = rep(c(0,1)))
dt1[,class(b)]
dt1[,b:=factor(b)]
dt1[,class(b)]

But I don't, and instead have a list of the variable names

vars.factors  <- c('b','c')

I can apply the factor function to them without a problem ...

lapply(vars.factors, function(x) dt1[,class(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])

But I don't know how to re-assign or update the original column in the data table.

This fails ...

  lapply(vars.factors, function(x) dt1[,x:=factor(get(x))])
  # Error in get(x) : invalid first argument 

As does this ...

  lapply(vars.factors, function(x) dt1[,get(x):=factor(get(x))])
  # Error in get(x) : object 'b' not found 

NB. I tried the answer proposed here without any luck.


回答1:


Yes, this is fairly straightforward:

dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]

In the LHS (of := in j), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD (which stands for Subset of Data), and we specify the columns that should be in .SD with the .SDcols argument.

Following up on comment:

Note that we need to wrap LHS with () for it to be evaluated and fetch the column names within vars.factors variable. This is because we allow the syntax

DT[, col := value]

when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col and assigns value to it.

To distinguish these two cases apart, we need the (). Wrapping it with () is sufficient to identify that we really need to get the values within the variable.




回答2:


Using data frame:

> df1 = data.frame(dt1)
> df1[,vars.factors] = data.frame(sapply(df1[,vars.factors], factor))
> dt1 = data.table(df1)

> dt1
   a b c
1: 1 1 b
2: 2 2 c
3: 3 3 b
4: 4 4 c

> str(dt1)
Classes ‘data.table’ and 'data.frame':  4 obs. of  3 variables:
 $ a: int  1 2 3 4
 $ b: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
 $ c: Factor w/ 2 levels "b","c": 1 2 1 2
 - attr(*, ".internal.selfref")=<externalptr> 


来源:https://stackoverflow.com/questions/26299159/can-i-programmatically-update-the-type-of-a-set-of-columns-to-factors-in-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!