R generate all possible interaction variables

后端 未结 3 2109
既然无缘
既然无缘 2021-02-14 17:29

I have a dataframe with variables, say a,b,c,d

dat <- data.frame(a=runif(1e5), b=runif(1e5), c=runif(1e5), d=runif(1e5))

and would like to g

相关标签:
3条回答
  • 2021-02-14 18:05

    What do you plan to do with all these interaction terms? There are several options, which is best will depend on what you are trying to do.

    If you want to pass the interactions to a modeling function like lm or aov then it is very simple, just use the .^2 syntax:

    fit <- lm( y ~ .^2, data=mydf )
    

    The above will call lm and tell it to fit all the main effects and all 2 way interaction for the variables in mydf excluding y.

    If for some reason you really want to calculate all the interactions then you can use model.matrix:

    tmp <- model.matrix( ~.^2, data=iris)
    

    This will include a column for the intercept and columns for the main effects, but you can drop those if you don't want them.

    If you need something different from the modeling then you can use the combn function as @akrun mentions in the comments.

    0 讨论(0)
  • 2021-02-14 18:06

    Assuming that the expected output would be the combinations of column names (from the comments it should be a_b, a_c etc), we can use combn on the column names of the dataset and specify the m as 2.

    combn(colnames(dat), 2, FUN=paste, collapse='_')
    #[1] "a_b" "a_c" "a_d" "b_c" "b_d" "c_d"
    

    If we need to multiply the combinations of columns in 'dat', we subset the dataset using each element of the combn output of column names (dat[,x[1]], dat[,x[2]]), multiply (*) it, convert to 'data.frame' (data.frame(), set the column names (setNames) by pasteing the combination of column names. We create the output in a list and cbind the list elements with do.call(cbind.

    do.call(cbind, combn(colnames(dat), 2, FUN= function(x) 
                    list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), 
                     paste(x, collapse="_")) )))
    #         a_b        a_c        a_d        b_c        b_d        c_d
    #1 0.26929788 0.17697473 0.26453066 0.55676619 0.83221898 0.54691008
    #2 0.06291005 0.08337501 0.04455453 0.10370775 0.05542008 0.07344851
    #3 0.53789990 0.47301970 0.03112880 0.51305076 0.03376319 0.02969076
    #4 0.41596384 0.34920860 0.25992717 0.53948322 0.40155468 0.33711187
    #5 0.16878584 0.21232357 0.09196025 0.08162171 0.03535148 0.04447027
    

    Benchmarks

    set.seed(494)
    dat <- data.frame(a=runif(1e6), b=runif(1e6), c=runif(1e6), d=runif(1e6))
    
    greg <- function()model.matrix( ~.^2, data=dat)
    akrun <- function() {do.call(cbind, combn(colnames(dat), 2, FUN= function(x) 
               list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]), 
                paste(x, collapse="_")) )))}
    
    system.time(greg())
    #  user  system elapsed 
    #  1.159   0.024   1.182 
    
    system.time(akrun())
    #  user  system elapsed 
    #  0.013   0.000   0.013 
    
    library(microbenchmark)
    microbenchmark(greg(), akrun(), times=20L, unit='relative')
    # Unit: relative
    #   expr      min       lq     mean   median       uq      max neval cld
    # greg() 39.63122 38.53662 10.23198 18.81274 6.568741 4.642702    20   b
    # akrun()  1.00000  1.00000  1.00000  1.00000 1.000000 1.000000    20  a 
    

    NOTE: The benchmarks differ with number of columns, number of rows. Here, I am using the number of columns as showed in the OP's post.

    data

    set.seed(24)
    dat <- data.frame(a=runif(5), b=runif(5), c=runif(5), d=runif(5))
    
    0 讨论(0)
  • 2021-02-14 18:06

    Since model.matrix complains for factors with just one level, you alternatively might want to use stats::terms

    labels(terms(~.^2, data = iris[, 1:3]))
    # [1] "Sepal.Length"              "Sepal.Width"               "Petal.Length"             
    # [4] "Sepal.Length:Sepal.Width"  "Sepal.Length:Petal.Length" "Sepal.Width:Petal.Length"
    
    0 讨论(0)
提交回复
热议问题