Editing a dataframe to create a paired sample; removing records without a matching date in another group

后端 未结 1 1462
猫巷女王i
猫巷女王i 2021-01-21 17:55

I have done a bunch of searching for a solution to this and either can\'t find one or don\'t know it when I see it. I\'ve seen some topics that are close to this but deal with

1条回答
  •  天涯浪人
    2021-01-21 18:15

    Here are a few alternatives:

    1) ave

    > subset(my.df, ave(col3, col2, FUN = length) > 1)
      col1       col2 col3
    1    A 2001-01-01    3
    2    A 2001-01-02    2
    3    A 2001-01-03    6
    5    B 2001-01-01    7
    6    B 2001-01-02    4
    7    B 2001-01-03    1
    

    2) split / Filter / do.call

    > do.call("rbind", Filter(function(x) nrow(x) > 1, split(my.df, my.df$col2)))
                 col1       col2 col3
    2001-01-01.1    A 2001-01-01    3
    2001-01-01.5    B 2001-01-01    7
    2001-01-02.2    A 2001-01-02    2
    2001-01-02.6    B 2001-01-02    4
    2001-01-03.3    A 2001-01-03    6
    2001-01-03.7    B 2001-01-03    1
    

    3) dplyr (2) translates nearly directly into a dplyr solution:

    > library(dplyr)
    > my.df %>% group_by(col2) %>% filter(n() > 1)
    Source: local data frame [6 x 3]
    Groups: col2
    
      col1       col2 col3
    1    A 2001-01-01    5
    2    A 2001-01-02    1
    3    A 2001-01-03    7
    4    B 2001-01-01    2
    5    B 2001-01-02    4
    6    B 2001-01-03    6
    

    4) data.table The last two solutions can also be translated to data.table

    > data.table(my.df)[, if (.N > 1) .SD, by = col2]
             col2 col1 col3
    1: 2001-01-01    A    5
    2: 2001-01-01    B    2
    3: 2001-01-02    A    1
    4: 2001-01-02    B    4
    5: 2001-01-03    A    7
    6: 2001-01-03    B    6
    

    5) tapply

    > na.omit(tapply(my.df$col3, my.df[c('col2', 'col1')], identity))
                col1
    col2         A B
      2001-01-01 3 7
      2001-01-02 2 4
      2001-01-03 6 1
    attr(,"na.action")
    2001-02-03 2001-01-04 
             5          4 
    

    6) merge

    > merge(subset(my.df, col1 == 'A'), subset(my.df, col1 == 'B'), by = 2)
            col2 col1.x col3.x col1.y col3.y
    1 2001-01-01      A      3      B      7
    2 2001-01-02      A      2      B      4
    3 2001-01-03      A      6      B      1
    

    7) sqldf (6) is similar to the following sqldf solution:

    > sqldf("select * from `my.df` A join `my.df` B 
    +    on A.col2 = B.col2 and A.col1 = 'A' and B.col1 = 'B'")
      col1       col2 col3 col1       col2 col3
    1    A 2001-01-01    5    B 2001-01-01    2
    2    A 2001-01-02    1    B 2001-01-02    4
    3    A 2001-01-03    7    B 2001-01-03    6
    

    0 讨论(0)
提交回复
热议问题