Find duplicated elements with dplyr

后端 未结 5 1332
花落未央
花落未央 2020-12-04 11:06

I tried using the code presented here to find ALL duplicated elements with dplyr like this:

library(dplyr)

mtcars %>%
mutate(cyl.dup = cyl[duplicated(cyl         


        
相关标签:
5条回答
  • 2020-12-04 11:33

    I guess you could use filter for this purpose:

    mtcars %>% 
      group_by(carb) %>% 
      filter(n()>1)
    

    Small example (note that I added summarize() to prove that the resulting data set does not contain rows with duplicate 'carb'. I used 'carb' instead of 'cyl' because 'carb' has unique values whereas 'cyl' does not):

    mtcars %>% group_by(carb) %>% summarize(n=n())
    #Source: local data frame [6 x 2]
    #
    #  carb  n
    #1    1  7
    #2    2 10
    #3    3  3
    #4    4 10
    #5    6  1
    #6    8  1
    
    mtcars %>% group_by(carb) %>% filter(n()>1) %>% summarize(n=n())
    #Source: local data frame [4 x 2]
    #
    #  carb  n
    #1    1  7
    #2    2 10
    #3    3  3
    #4    4 10
    
    0 讨论(0)
  • 2020-12-04 11:39

    Another solution is to use janitor package:

    mtcars %>% get_dupes(wt)
    
    0 讨论(0)
  • 2020-12-04 11:44

    We can find duplicated elements with dplyr as follows.

    library(dplyr)
    
    # Only duplicated elements
    mtcars %>%
      filter(duplicated(.[["carb"]])
    
    # All duplicated elements
    mtcars %>%
      filter(carb %in% unique(.[["carb"]][duplicated(.[["carb"]])]))
    
    0 讨论(0)
  • 2020-12-04 11:48

    The original post contains an error in using the solution from the related answer. In the example given, when you use that solution inside mutate, it tries to subset the cyl vector which will not be of the same length as the mtcars dataframe.

    Instead you can use the following example with filter returning all duplicated elements or mutate with ifelse to create a dummy variable which can be filtered upon later:

     library(dplyr)
    
     # Return all duplicated elements
     mtcars %>%
       filter(duplicated(cyl) | duplicated(cyl, fromLast = TRUE))
     # Or for making dummy variable of all duplicated
     mtcars %>%
       mutate(cyl.dup =ifelse(duplicated(cyl) | duplicated(cyl, fromLast = TRUE), 1,0))
    
    0 讨论(0)
  • 2020-12-04 11:55
    # Adding a shortcut to the answer above
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    mtcars %>% count(carb)
    #> # A tibble: 6 x 2
    #>    carb     n
    #>   <dbl> <int>
    #> 1    1.     7
    #> 2    2.    10
    #> 3    3.     3
    #> 4    4.    10
    #> 5    6.     1
    #> 6    8.     1
    mtcars %>% count(carb) %>% filter(n > 1)
    #> # A tibble: 4 x 2
    #>    carb     n
    #>   <dbl> <int>
    #> 1    1.     7
    #> 2    2.    10
    #> 3    3.     3
    #> 4    4.    10
    
    # Showing an alternative that follows the apparent intention if the asker
    duplicated_carb <- mtcars %>% 
      mutate(dup_carb = duplicated(carb)) %>% 
      filter(dup_carb)
    duplicated_carb
    #>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb dup_carb
    #> 1  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     TRUE
    #> 2  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     TRUE
    #> 3  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1     TRUE
    #> 4  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     TRUE
    #> 5  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2     TRUE
    #> 6  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2     TRUE
    #> 7  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     TRUE
    #> 8  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     TRUE
    #> 9  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3     TRUE
    #> 10 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3     TRUE
    #> 11 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     TRUE
    #> 12 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     TRUE
    #> 13 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     TRUE
    #> 14 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1     TRUE
    #> 15 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2     TRUE
    #> 16 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1     TRUE
    #> 17 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     TRUE
    #> 18 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2     TRUE
    #> 19 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2     TRUE
    #> 20 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     TRUE
    #> 21 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2     TRUE
    #> 22 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1     TRUE
    #> 23 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     TRUE
    #> 24 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     TRUE
    #> 25 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4     TRUE
    #> 26 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     TRUE
    
    0 讨论(0)
提交回复
热议问题