Find duplicated rows (based on 2 columns) in Data Frame in R

后端 未结 6 670
独厮守ぢ
独厮守ぢ 2020-11-27 06:10

I have a data frame in R which looks like:

| RIC    | Date                | Open   |
|--------|---------------------|--------|
| S1A.PA | 2011-06-30 20:00:00         


        
相关标签:
6条回答
  • 2020-11-27 06:35

    Easy way to get the information you want is to use dplyr.

    yourDF %>% 
      group_by(RIC, Date) %>% 
      mutate(num_dups = n(), 
             dup_id = row_number()) %>% 
      ungroup() %>% 
      mutate(is_duplicated = dup_id > 1)
    

    Using this:

    • num_dups tells you how many times that particular combo is duplicated
    • dup_id tells you which duplicate number that particular row is (e.g. 1st, 2nd, or 3rd, etc)
    • is_duplicated gives you an easy condition you can filter on later to remove all the duplicate rows (e.g. filter(!is_duplicated)), though you could also use dup_id for this (e.g. filter(dup_id == 1))
    0 讨论(0)
  • 2020-11-27 06:39

    If you want to remove duplicate records based on values of Columns Date and State in dataset data.frame:

    #Indexes of the duplicate rows that will be removed: 
    duplicate_indexes <- which(duplicated(dataset[c('Date', 'State')]),) 
    duplicate_indexes 
    
    #new_uniq will contain unique dataset without the duplicates. 
    new_uniq <- dataset[!duplicated(dataset[c('Date', 'State')]),] 
    View(new_uniq) 
    
    0 讨论(0)
  • 2020-11-27 06:44

    Here's a dplyr option for tagging duplicates based on two (or more) columns. In this case ric and date:

    df <- data_frame(ric = c('S1A.PA', 'ABC.PA', 'EFG.PA', 'S1A.PA', 'ABC.PA', 'EFG.PA'),
                     date = c('2011-06-30 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00', '2011-07-05 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00'),
                     open = c(23.7, 24.31, 24.495, 24.23, 24.31, 24.495))
    
    df %>% 
      group_by(ric, date) %>% 
      mutate(dupe = n()>1)
    # A tibble: 6 x 4
    # Groups:   ric, date [4]
      ric    date                 open dupe 
      <chr>  <chr>               <dbl> <lgl>
    1 S1A.PA 2011-06-30 20:00:00  23.7 FALSE
    2 ABC.PA 2011-07-03 20:00:00  24.3 TRUE 
    3 EFG.PA 2011-07-04 20:00:00  24.5 TRUE 
    4 S1A.PA 2011-07-05 20:00:00  24.2 FALSE
    5 ABC.PA 2011-07-03 20:00:00  24.3 TRUE 
    6 EFG.PA 2011-07-04 20:00:00  24.5 TRUE 
    
    0 讨论(0)
  • 2020-11-27 06:45

    You can always try simply passing those first two columns to the function duplicated:

    duplicated(dat[,1:2])
    

    assuming your data frame is called dat. For more information, we can consult the help files for the duplicated function by typing ?duplicated at the console. This will provide the following sentences:

    Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.

    So duplicated returns a logical vector, which we can then use to extract a subset of dat:

    ind <- duplicated(dat[,1:2])
    dat[ind,]
    

    or you can skip the separate assignment step and simply use:

    dat[duplicated(dat[,1:2]),]
    
    0 讨论(0)
  • 2020-11-27 06:48

    dplyr is so much nicer for this sort of thing:

    library(dplyr)
    yourDataFrame %>%
        distinct(RIC, Date, .keep_all = TRUE)
    

    (the ".keep_all is optional. if not used, it will return only the deduped 2 columns. when used, it returns the deduped whole data frame)

    0 讨论(0)
  • 2020-11-27 06:52

    I think what you're looking for is a way to return a data frame of the duplicated rows in the same format as your original data. There is probably a more elegant way to do this but this works:

    dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
    colnames(dup) <- c("dup") #renames column for simplicity
    df2 <- cbind(df, dup) #bind to original df
    df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
    
    0 讨论(0)
提交回复
热议问题