Leave only those rows in matrices which have equal elements at certain column

前端 未结 4 985
一向
一向 2021-01-28 00:56

Let me show an example. Consider we have 3 tables (focus on columns N):

   Table 1         Table 2        Table 3
-------------   -------------   -------------
           


        
4条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-28 01:23

    I would like to propose a generic approach which works for an arbitrary number of dataframes as well as for multiple id columns.

    The dataframes may have a different structure, i.e., different number and type of columns. The only requirement is that the dataframes share all id columns having the same name and type. In addition, it will detect if there are no common combinations of id values between the dataframes.

    Supposed, we have a list of dataframes dfl and a vector of column names cn which should be check for common value combinations across all dataframes in the list:

    dfl <- list(Table1, Table2, Table3)
    cn <- "N"
    
    library(data.table)
    # determine common combinations of id values
    common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
      , .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
    # stop if there are no column id values
    stopifnot(nrow(common) > 0L)
    # join with all data tables in dfl, keeping only rows which have common id values
    result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])
    
    result
    
    $Table1
        N Values
    1:  5      1
    2: 10      2
    3: 15      3
    
    $Table2
        N Values
    1:  5     -1
    2: 10     -3
    3: 15     -4
    
    $Table3
        N Values
    1:  5      1
    2: 10      5
    3: 15      3
    

    Data

    dfl <- structure(list(Table1 = structure(list(N = c(5L, 10L, 15L), Values = 1:3), .Names = c("N", 
    "Values"), row.names = c(NA, 3L), class = "data.frame"), Table2 = structure(list(
        N = c(5L, 6L, 10L, 15L), Values = c(-1L, -2L, -3L, -4L)), .Names = c("N", 
    "Values"), row.names = c(NA, 4L), class = "data.frame"), Table3 = structure(list(
        N = c(5L, 6L, 10L, 12L, 15L), Values = c(1L, 21L, 5L, 6L, 
        3L)), .Names = c("N", "Values"), row.names = c(NA, 5L), class = "data.frame")), .Names = c("Table1", 
    "Table2", "Table3"))
    

    Example with multiple id columns

    # create sample data: 5 dataframes with 100 rows each and 3 id columns
    set.seed(123L)
    ndf <- 5L
    dfl <- lapply(seq_len(ndf), function(i) {
      nr <- 100L
      nseq <- 1:6
      data.frame(A = sample(LETTERS[nseq], nr, replace = TRUE),
                 b = sample(letters[nseq], nr, replace = TRUE),
                 i = sample(nseq, nr, replace = TRUE),
                 val = sample.int(nr, nr))
      })
    dfl <- setNames(dfl, paste0("df", seq_along(dfl)))
    str(dfl)
    
    List of 5
     $ df1:'data.frame':  100 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 2 5 3 6 6 1 4 6 4 3 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 2 3 6 3 6 6 4 3 1 ...
      ..$ i  : int [1:100] 2 6 4 4 3 6 3 2 2 2 ...
      ..$ val: int [1:100] 79 1 77 71 61 46 15 99 42 45 ...
     $ df2:'data.frame':  100 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 1 6 4 3 3 5 1 3 5 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 3 3 2 1 3 2 4 4 6 3 ...
      ..$ i  : int [1:100] 2 5 2 2 2 5 1 5 2 3 ...
      ..$ val: int [1:100] 85 26 3 84 33 61 52 36 18 40 ...
     $ df3:'data.frame':  100 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 3 3 1 1 2 6 3 3 5 5 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 4 6 4 5 4 5 6 5 1 ...
      ..$ i  : int [1:100] 2 4 1 6 6 3 5 2 1 3 ...
      ..$ val: int [1:100] 81 73 22 99 84 51 57 88 93 61 ...
     $ df4:'data.frame':  100 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 3 5 3 6 1 1 5 4 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 1 3 4 6 5 4 1 1 5 1 ...
      ..$ i  : int [1:100] 2 2 1 3 2 5 4 6 1 6 ...
      ..$ val: int [1:100] 94 98 45 23 67 53 55 41 40 100 ...
     $ df5:'data.frame':  100 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 4 1 2 5 5 1 6 1 4 3 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 5 1 3 6 6 5 1 4 6 4 ...
      ..$ i  : int [1:100] 1 6 2 5 4 1 6 4 6 4 ...
      ..$ val: int [1:100] 45 28 16 85 54 53 56 68 59 94 ...
    
    # define id columns
    cn <- c("i", "A", "b")
    
    common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
      , .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
    stopifnot(nrow(common) > 0L)
    result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])
    
    str(result)
    
    List of 5
     $ df1:Classes ‘data.table’ and 'data.frame': 10 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 6 6 6 4 2 1 5
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 4 4 6 6 3 2 3 4 2
      ..$ i  : int [1:10] 2 2 2 3 3 6 5 6 4 1
      ..$ val: int [1:10] 99 85 4 36 83 70 12 52 53 58
      ..- attr(*, ".internal.selfref")= 
     $ df2:Classes ‘data.table’ and 'data.frame': 11 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 4 4 2 1 5 5 4 1 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 3 2 2 3 4 4 4 1 1 ...
      ..$ i  : int [1:11] 2 6 5 5 6 4 1 1 5 3 ...
      ..$ val: int [1:11] 11 1 58 14 5 71 52 39 81 88 ...
      ..- attr(*, ".internal.selfref")= 
     $ df3:Classes ‘data.table’ and 'data.frame': 14 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 4 2 1 1 5 5 5 5 5 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 2 3 4 4 2 2 4 4 4 ...
      ..$ i  : int [1:14] 3 5 6 4 4 1 1 1 1 1 ...
      ..$ val: int [1:14] 25 60 18 78 59 26 32 39 77 28 ...
      ..- attr(*, ".internal.selfref")= 
     $ df4:Classes ‘data.table’ and 'data.frame': 14 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 4 2 2 5 5 4 4 ...
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 3 3 2 3 3 2 2 1 1 ...
      ..$ i  : int [1:14] 3 6 6 5 6 6 1 1 5 5 ...
      ..$ val: int [1:14] 56 86 34 70 31 12 72 1 5 64 ...
      ..- attr(*, ".internal.selfref")= 
     $ df5:Classes ‘data.table’ and 'data.frame': 6 obs. of  4 variables:
      ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 1 1 2
      ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 6 3 4 1 4
      ..$ i  : int [1:6] 2 3 6 4 3 4
      ..$ val: int [1:6] 11 48 1 68 32 46
      ..- attr(*, ".internal.selfref")=
    

    In each dataframe, there are only a few rows left over which share common combinations of id values:

    unlist(lapply(result, nrow))
    
    df1 df2 df3 df4 df5 
     10  11  14  14   6
    

提交回复
热议问题