How can I subset rows in a data frame in R based on a vector of values?

前端 未结 4 1870
挽巷
挽巷 2020-11-28 04:48

I have two data sets that are supposed to be the same size but aren\'t. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a

相关标签:
4条回答
  • 2020-11-28 04:49

    Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:

    # Set seed for reproducibility.
    set.seed(1)
    
    # Create two sample data frames.
    data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
    data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))
    
    # compare data frames by specific columns and keep only 
    # the rows with equal values 
    data_A[data_A$id %in% data_B$id,]   # will keep data in data_A
    data_B[data_B$id %in% data_A$id,]   # will keep data in data_b
    

    Results:

    > data_A[data_A$id %in% data_B$id,]
      id value
    1  1     1
    2  2     2
    3  3     3
    
    > data_B[data_B$id %in% data_A$id,]
      id value
    1  1     5
    2  2     6
    3  3     7
    
    0 讨论(0)
  • 2020-11-28 04:52

    Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff statement unnecessary.

    Using the data from Dinre's example:

    In base R:

    cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
    cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)
    

    Using the dplyr package:

    library(dplyr)
    cleanedA <- inner_join(data_A, data_B %>% select(index))
    cleanedB <- inner_join(data_B, data_A %>% select(index))
    

    To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.

    0 讨论(0)
  • 2020-11-28 05:03

    This will give you what you want:

    eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]
    

    The error in your second attempt is because you forgot the ,

    In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.

    However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.

    0 讨论(0)
  • 2020-11-28 05:05

    If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:

    data_A[match(data_B$index, data_A$index, nomatch=0),]
    data_B[match(data_A$index, data_B$index, nomatch=0),]
    

    This is, though, the same as:

    data_A[data_A$index %in% data_B$index,]
    data_B[data_B$index %in% data_A$index,]
    

    Here is a demo:

    # Set seed for reproducibility.
    set.seed(1)
    
    # Create two sample data sets.
    data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
    data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))
    
    # Subset data of each data frame by the index in the other.
    t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
    t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]
    
    # Make sure they match.
    data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]
    
    #    index     value index.1    value.1
    # 27     3 0.7155661       3 0.65887761
    # 10    12 0.6049333      12 0.14362694
    # 88    14 0.7410786      14 0.42021589
    # 56    15 0.4525708      15 0.78101754
    # 38    18 0.2075451      18 0.70277874
    # 24    23 0.4314737      23 0.78218212
    # 34    32 0.1734423      32 0.85508236
    # 22    38 0.7317925      38 0.56426384
    # 84    39 0.3913593      39 0.09485786
    # 5     40 0.7789147      40 0.31248966
    # 74    43 0.7799849      43 0.10910096
    # 71    45 0.2847905      45 0.26787813
    # 57    46 0.1751268      46 0.17719454
    # 25    48 0.1482116      48 0.99607737
    # 81    53 0.6304141      53 0.26721208
    # 60    58 0.8645449      58 0.96920881
    # 30    59 0.6401010      59 0.67371223
    # 75    61 0.8806190      61 0.69882454
    # 63    64 0.3287773      64 0.36918946
    # 19    70 0.9240745      70 0.11350771
    
    0 讨论(0)
提交回复
热议问题