I have a dataframe with variables, say a,b,c,d
dat <- data.frame(a=runif(1e5), b=runif(1e5), c=runif(1e5), d=runif(1e5))
and would like to g
What do you plan to do with all these interaction terms? There are several options, which is best will depend on what you are trying to do.
If you want to pass the interactions to a modeling function like lm
or aov
then it is very simple, just use the .^2
syntax:
fit <- lm( y ~ .^2, data=mydf )
The above will call lm
and tell it to fit all the main effects and all 2 way interaction for the variables in mydf
excluding y
.
If for some reason you really want to calculate all the interactions then you can use model.matrix
:
tmp <- model.matrix( ~.^2, data=iris)
This will include a column for the intercept and columns for the main effects, but you can drop those if you don't want them.
If you need something different from the modeling then you can use the combn
function as @akrun mentions in the comments.
Assuming that the expected output would be the combinations of column names (from the comments it should be a_b
, a_c
etc), we can use combn
on the column names of the dataset and specify the m
as 2.
combn(colnames(dat), 2, FUN=paste, collapse='_')
#[1] "a_b" "a_c" "a_d" "b_c" "b_d" "c_d"
If we need to multiply the combinations of columns in 'dat', we subset the dataset using each element of the combn
output of column names (dat[,x[1]]
, dat[,x[2]]
), multiply (*
) it, convert to 'data.frame' (data.frame(
), set the column names (setNames
) by paste
ing the combination of column names. We create the output in a list
and cbind
the list elements with do.call(cbind
.
do.call(cbind, combn(colnames(dat), 2, FUN= function(x)
list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]),
paste(x, collapse="_")) )))
# a_b a_c a_d b_c b_d c_d
#1 0.26929788 0.17697473 0.26453066 0.55676619 0.83221898 0.54691008
#2 0.06291005 0.08337501 0.04455453 0.10370775 0.05542008 0.07344851
#3 0.53789990 0.47301970 0.03112880 0.51305076 0.03376319 0.02969076
#4 0.41596384 0.34920860 0.25992717 0.53948322 0.40155468 0.33711187
#5 0.16878584 0.21232357 0.09196025 0.08162171 0.03535148 0.04447027
set.seed(494)
dat <- data.frame(a=runif(1e6), b=runif(1e6), c=runif(1e6), d=runif(1e6))
greg <- function()model.matrix( ~.^2, data=dat)
akrun <- function() {do.call(cbind, combn(colnames(dat), 2, FUN= function(x)
list(setNames(data.frame(dat[,x[1]]*dat[,x[2]]),
paste(x, collapse="_")) )))}
system.time(greg())
# user system elapsed
# 1.159 0.024 1.182
system.time(akrun())
# user system elapsed
# 0.013 0.000 0.013
library(microbenchmark)
microbenchmark(greg(), akrun(), times=20L, unit='relative')
# Unit: relative
# expr min lq mean median uq max neval cld
# greg() 39.63122 38.53662 10.23198 18.81274 6.568741 4.642702 20 b
# akrun() 1.00000 1.00000 1.00000 1.00000 1.000000 1.000000 20 a
NOTE: The benchmarks differ with number of columns, number of rows. Here, I am using the number of columns as showed in the OP's post.
set.seed(24)
dat <- data.frame(a=runif(5), b=runif(5), c=runif(5), d=runif(5))
Since model.matrix
complains for factors with just one level, you alternatively might want to use stats::terms
labels(terms(~.^2, data = iris[, 1:3]))
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length"
# [4] "Sepal.Length:Sepal.Width" "Sepal.Length:Petal.Length" "Sepal.Width:Petal.Length"