Let me show an example. Consider we have 3 tables (focus on columns N):
Table 1 Table 2 Table 3
------------- ------------- -------------
I would like to propose a generic approach which works for an arbitrary number of dataframes as well as for multiple id columns.
The dataframes may have a different structure, i.e., different number and type of columns. The only requirement is that the dataframes share all id columns having the same name and type. In addition, it will detect if there are no common combinations of id values between the dataframes.
Supposed, we have a list of dataframes dfl
and a vector of column names cn
which should be check for common value combinations across all dataframes in the list:
dfl <- list(Table1, Table2, Table3)
cn <- "N"
library(data.table)
# determine common combinations of id values
common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
, .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
# stop if there are no column id values
stopifnot(nrow(common) > 0L)
# join with all data tables in dfl, keeping only rows which have common id values
result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])
result
$Table1 N Values 1: 5 1 2: 10 2 3: 15 3 $Table2 N Values 1: 5 -1 2: 10 -3 3: 15 -4 $Table3 N Values 1: 5 1 2: 10 5 3: 15 3
dfl <- structure(list(Table1 = structure(list(N = c(5L, 10L, 15L), Values = 1:3), .Names = c("N",
"Values"), row.names = c(NA, 3L), class = "data.frame"), Table2 = structure(list(
N = c(5L, 6L, 10L, 15L), Values = c(-1L, -2L, -3L, -4L)), .Names = c("N",
"Values"), row.names = c(NA, 4L), class = "data.frame"), Table3 = structure(list(
N = c(5L, 6L, 10L, 12L, 15L), Values = c(1L, 21L, 5L, 6L,
3L)), .Names = c("N", "Values"), row.names = c(NA, 5L), class = "data.frame")), .Names = c("Table1",
"Table2", "Table3"))
# create sample data: 5 dataframes with 100 rows each and 3 id columns
set.seed(123L)
ndf <- 5L
dfl <- lapply(seq_len(ndf), function(i) {
nr <- 100L
nseq <- 1:6
data.frame(A = sample(LETTERS[nseq], nr, replace = TRUE),
b = sample(letters[nseq], nr, replace = TRUE),
i = sample(nseq, nr, replace = TRUE),
val = sample.int(nr, nr))
})
dfl <- setNames(dfl, paste0("df", seq_along(dfl)))
str(dfl)
List of 5 $ df1:'data.frame': 100 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 2 5 3 6 6 1 4 6 4 3 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 4 2 3 6 3 6 6 4 3 1 ... ..$ i : int [1:100] 2 6 4 4 3 6 3 2 2 2 ... ..$ val: int [1:100] 79 1 77 71 61 46 15 99 42 45 ... $ df2:'data.frame': 100 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 1 6 4 3 3 5 1 3 5 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 3 3 2 1 3 2 4 4 6 3 ... ..$ i : int [1:100] 2 5 2 2 2 5 1 5 2 3 ... ..$ val: int [1:100] 85 26 3 84 33 61 52 36 18 40 ... $ df3:'data.frame': 100 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 3 3 1 1 2 6 3 3 5 5 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 6 4 6 4 5 4 5 6 5 1 ... ..$ i : int [1:100] 2 4 1 6 6 3 5 2 1 3 ... ..$ val: int [1:100] 81 73 22 99 84 51 57 88 93 61 ... $ df4:'data.frame': 100 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 6 3 5 3 6 1 1 5 4 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 1 3 4 6 5 4 1 1 5 1 ... ..$ i : int [1:100] 2 2 1 3 2 5 4 6 1 6 ... ..$ val: int [1:100] 94 98 45 23 67 53 55 41 40 100 ... $ df5:'data.frame': 100 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 4 1 2 5 5 1 6 1 4 3 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 5 1 3 6 6 5 1 4 6 4 ... ..$ i : int [1:100] 1 6 2 5 4 1 6 4 6 4 ... ..$ val: int [1:100] 45 28 16 85 54 53 56 68 59 94 ...
# define id columns
cn <- c("i", "A", "b")
common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
, .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
stopifnot(nrow(common) > 0L)
result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])
str(result)
List of 5 $ df1:Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 6 6 6 4 2 1 5 ..$ b : Factor w/ 6 levels "a","b","c","d",..: 4 4 4 6 6 3 2 3 4 2 ..$ i : int [1:10] 2 2 2 3 3 6 5 6 4 1 ..$ val: int [1:10] 99 85 4 36 83 70 12 52 53 58 ..- attr(*, ".internal.selfref")=
$ df2:Classes ‘data.table’ and 'data.frame': 11 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 6 4 4 2 1 5 5 4 1 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 4 3 2 2 3 4 4 4 1 1 ... ..$ i : int [1:11] 2 6 5 5 6 4 1 1 5 3 ... ..$ val: int [1:11] 11 1 58 14 5 71 52 39 81 88 ... ..- attr(*, ".internal.selfref")= $ df3:Classes ‘data.table’ and 'data.frame': 14 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 4 2 1 1 5 5 5 5 5 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 6 2 3 4 4 2 2 4 4 4 ... ..$ i : int [1:14] 3 5 6 4 4 1 1 1 1 1 ... ..$ val: int [1:14] 25 60 18 78 59 26 32 39 77 28 ... ..- attr(*, ".internal.selfref")= $ df4:Classes ‘data.table’ and 'data.frame': 14 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 4 2 2 5 5 4 4 ... ..$ b : Factor w/ 6 levels "a","b","c","d",..: 6 3 3 2 3 3 2 2 1 1 ... ..$ i : int [1:14] 3 6 6 5 6 6 1 1 5 5 ... ..$ val: int [1:14] 56 86 34 70 31 12 72 1 5 64 ... ..- attr(*, ".internal.selfref")= $ df5:Classes ‘data.table’ and 'data.frame': 6 obs. of 4 variables: ..$ A : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 1 1 2 ..$ b : Factor w/ 6 levels "a","b","c","d",..: 4 6 3 4 1 4 ..$ i : int [1:6] 2 3 6 4 3 4 ..$ val: int [1:6] 11 48 1 68 32 46 ..- attr(*, ".internal.selfref")=
In each dataframe, there are only a few rows left over which share common combinations of id values:
unlist(lapply(result, nrow))
df1 df2 df3 df4 df5 10 11 14 14 6