Duplicated rows: select rows based on criteria and store duplicated values

前端 未结 2 1387
有刺的猬
有刺的猬 2021-01-23 19:17

I am working on a raw dataset that looks something like this:

df <- data.frame(\"ID\" = c(\"Alpha\", \"Alpha\", \"Alpha\", \"Alpha\", 
                                


        
相关标签:
2条回答
  • 2021-01-23 19:32

    Here is one option with dplyr. After grouping by 'ID', 'Year', create a logical column ('ind') that checks the max of 'Val2', using that create two columns corresponding to 'Val' with 'del' as prefix for those values that are eliminated, as well as the 'treatment' not present, filter the rows based on 'ind' and ungroup

    library(dplyr)
    df %>% 
       group_by(ID, Year) %>% 
       mutate(ind = Val2 == max(Val2) & !is.na(Val2)) %>% 
       mutate_at(vars(matches('Val')), 
            list(del = ~ if(any(!ind)) .[!ind] else NA_real_)) %>% 
       mutate(del_treat = if(any(!ind)) treatment[!ind] else NA_character_) %>% 
       filter(ind) %>%
       ungroup %>%
       select(-ind)
    
    0 讨论(0)
  • 2021-01-23 19:50

    Using data.table, a dcast based on rowid(ID, Year) after ordering by Val2 descending gets you there with the exception of column names. The "_1" columns are the "keep" columns, and the "_2" columns are the "del" columns.

    library(data.table)
    setDT(df)
    
    setorder(df, ID, Year, -Val2)
    
    out <- 
      dcast(df, ID + Year ~ rowid(ID, Year), value.var = c('treatment', 'Val', 'Val2'))
    out
    #       ID Year treatment_1 treatment_2 Val_1 Val_2 Val2_1 Val2_2
    # 1: Alpha 1970           B           A     0     0   2.34   0.00
    # 2: Alpha 1980           C        <NA>     0    NA   1.30     NA
    # 3: Alpha 1990           D        <NA>     1    NA   0.00     NA
    # 4:  Beta 1970           E        <NA>     0    NA   0.00     NA
    # 5:  Beta 1980           G           F     0     1   3.20   2.34
    # 6:  Beta 1990           H        <NA>     1    NA   1.30     NA
    

    We can change the names to match yours, only difference is the del columns have a number at the end. Would be useful if there is possiblity of > 2 rows per group.

    setnames(out, function(x) gsub('(.*)_1', '\\1', x))
    setnames(out, function(x) gsub('(.*_\\d+)', 'del_\\1', x))
    out
    #       ID Year treatment del_treatment_2 Val del_Val_2 Val2 del_Val2_2
    # 1: Alpha 1970         B               A   0         0 2.34       0.00
    # 2: Alpha 1980         C            <NA>   0        NA 1.30         NA
    # 3: Alpha 1990         D            <NA>   1        NA 0.00         NA
    # 4:  Beta 1970         E            <NA>   0        NA 0.00         NA
    # 5:  Beta 1980         G               F   0         1 3.20       2.34
    # 6:  Beta 1990         H            <NA>   1        NA 1.30         NA
    
    0 讨论(0)
提交回复
热议问题