Counting the number of elements with the values of x in a vector

后端 未结 19 1370
闹比i
闹比i 2020-11-22 02:44

I have a vector of numbers:

numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,
         453,435,324,34,456,56,567,65,34,435)

How can I hav

19条回答
  •  梦毁少年i
    2020-11-22 03:27

    A method that is relatively fast on long vectors and gives a convenient output is to use lengths(split(numbers, numbers)) (note the S at the end of lengths):

    # Make some integer vectors of different sizes
    set.seed(123)
    x <- sample.int(1e3, 1e4, replace = TRUE)
    xl <- sample.int(1e3, 1e6, replace = TRUE)
    xxl <-sample.int(1e3, 1e7, replace = TRUE)
    
    # Number of times each value appears in x:
    a <- lengths(split(x,x))
    
    # Number of times the value 64 appears:
    a["64"]
    #~ 64
    #~ 15
    
    # Occurences of the first 10 values
    a[1:10]
    #~ 1  2  3  4  5  6  7  8  9 10 
    #~ 13 12  6 14 12  5 13 14 11 14 
    

    The output is simply a named vector.
    The speed appears comparable to rle proposed by JBecker and even a bit faster on very long vectors. Here is a microbenchmark in R 3.6.2 with some of the functions proposed:

    library(microbenchmark)
    
    f1 <- function(vec) lengths(split(vec,vec))
    f2 <- function(vec) table(vec)
    f3 <- function(vec) rle(sort(vec))
    f4 <- function(vec) plyr::count(vec)
    
    microbenchmark(split = f1(x),
                   table = f2(x),
                   rle = f3(x),
                   plyr = f4(x))
    #~ Unit: microseconds
    #~   expr      min        lq      mean    median        uq      max neval  cld
    #~  split  402.024  423.2445  492.3400  446.7695  484.3560 2970.107   100  b  
    #~  table 1234.888 1290.0150 1378.8902 1333.2445 1382.2005 3203.332   100    d
    #~    rle  227.685  238.3845  264.2269  245.7935  279.5435  378.514   100 a   
    #~   plyr  758.866  793.0020  866.9325  843.2290  894.5620 2346.407   100   c 
    
    microbenchmark(split = f1(xl),
                   table = f2(xl),
                   rle = f3(xl),
                   plyr = f4(xl))
    #~ Unit: milliseconds
    #~   expr       min        lq      mean    median        uq       max neval cld
    #~  split  21.96075  22.42355  26.39247  23.24847  24.60674  82.88853   100 ab 
    #~  table 100.30543 104.05397 111.62963 105.54308 110.28732 168.27695   100   c
    #~    rle  19.07365  20.64686  23.71367  21.30467  23.22815  78.67523   100 a  
    #~   plyr  24.33968  25.21049  29.71205  26.50363  27.75960  92.02273   100  b 
    
    microbenchmark(split = f1(xxl),
                   table = f2(xxl),
                   rle = f3(xxl),
                   plyr = f4(xxl))
    #~ Unit: milliseconds
    #~   expr       min        lq      mean    median        uq       max neval  cld
    #~  split  296.4496  310.9702  342.6766  332.5098  374.6485  421.1348   100 a   
    #~  table 1151.4551 1239.9688 1283.8998 1288.0994 1323.1833 1385.3040   100    d
    #~    rle  399.9442  430.8396  464.2605  471.4376  483.2439  555.9278   100   c 
    #~   plyr  350.0607  373.1603  414.3596  425.1436  437.8395  506.0169   100  b  
    

    Importantly, the only function that also counts the number of missing values NA is plyr::count. These can also be obtained separately using sum(is.na(vec))

提交回复
热议问题