R generate all possible interaction variables

后端 未结 3 2119
既然无缘
既然无缘 2021-02-14 17:29

I have a dataframe with variables, say a,b,c,d

dat <- data.frame(a=runif(1e5), b=runif(1e5), c=runif(1e5), d=runif(1e5))

and would like to g

3条回答
  •  渐次进展
    2021-02-14 18:06

    Assuming that the expected output would be the combinations of column names (from the comments it should be a_b, a_c etc), we can use combn on the column names of the dataset and specify the m as 2.

    combn(colnames(dat), 2, FUN=paste, collapse='_')
    #[1] "a_b" "a_c" "a_d" "b_c" "b_d" "c_d"
    

    If we need to multiply the combinations of columns in 'dat', we subset the dataset using each element of the combn output of column names (dat[,x[1]], dat[,x[2]]), multiply (*) it, convert to 'data.frame' (data.frame(), set the column names (setNames) by pasteing the combination of column names. We create the output in a list and cbind the list elements with do.call(cbind.

    do.call(cbind, combn(colnames(dat), 2, FUN= function(x) 
                    list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), 
                     paste(x, collapse="_")) )))
    #         a_b        a_c        a_d        b_c        b_d        c_d
    #1 0.26929788 0.17697473 0.26453066 0.55676619 0.83221898 0.54691008
    #2 0.06291005 0.08337501 0.04455453 0.10370775 0.05542008 0.07344851
    #3 0.53789990 0.47301970 0.03112880 0.51305076 0.03376319 0.02969076
    #4 0.41596384 0.34920860 0.25992717 0.53948322 0.40155468 0.33711187
    #5 0.16878584 0.21232357 0.09196025 0.08162171 0.03535148 0.04447027
    

    Benchmarks

    set.seed(494)
    dat <- data.frame(a=runif(1e6), b=runif(1e6), c=runif(1e6), d=runif(1e6))
    
    greg <- function()model.matrix( ~.^2, data=dat)
    akrun <- function() {do.call(cbind, combn(colnames(dat), 2, FUN= function(x) 
               list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), 
                paste(x, collapse="_")) )))}
    
    system.time(greg())
    #  user  system elapsed 
    #  1.159   0.024   1.182 
    
    system.time(akrun())
    #  user  system elapsed 
    #  0.013   0.000   0.013 
    
    library(microbenchmark)
    microbenchmark(greg(), akrun(), times=20L, unit='relative')
    # Unit: relative
    #   expr      min       lq     mean   median       uq      max neval cld
    # greg() 39.63122 38.53662 10.23198 18.81274 6.568741 4.642702    20   b
    # akrun()  1.00000  1.00000  1.00000  1.00000 1.000000 1.000000    20  a 
    

    NOTE: The benchmarks differ with number of columns, number of rows. Here, I am using the number of columns as showed in the OP's post.

    data

    set.seed(24)
    dat <- data.frame(a=runif(5), b=runif(5), c=runif(5), d=runif(5))
    

提交回复
热议问题