How to create a consecutive group number

后端 未结 8 1075
深忆病人
深忆病人 2020-11-21 16:13

I have a data frame (all_data) in which I have a list of sites (1... to n) and their scores e.g.

  site  score
     1    10
     1    11  
              


        
相关标签:
8条回答
  • 2020-11-21 16:38

    Another way to do it. That I think is easy to get even when you know little about R:

    library(dplyr)
    df <- data.frame('site' = c(1, 1, 1, 4, 4, 4, 8, 8, 8))
    df <- mutate(df, 'number' = cumsum(site != lag(site, default=-1)))
    
    0 讨论(0)
  • 2020-11-21 16:48

    You can turn site into a factor and then return the numeric or integer values of that factor:

    dat <- data.frame(site = rep(c(1,4,8), each = 3), score = runif(9))
    dat$number <- as.integer(factor(dat$site))
    dat
    
      site     score number
    1    1 0.5305773      1
    2    1 0.9367732      1
    3    1 0.1831554      1
    4    4 0.4068128      2
    5    4 0.3438962      2
    6    4 0.8123883      2
    7    8 0.9122846      3
    8    8 0.2949260      3
    9    8 0.6771526      3
    
    0 讨论(0)
  • 2020-11-21 16:56

    In the new dplyr 1.0.0 we can use cur_group_id() which gives a unique numeric identifier to a group.

    library(dplyr)
    df %>% group_by(site) %>% mutate(number = cur_group_id())
    
    #  site score number
    #  <int> <int>  <int>
    #1     1    10      1
    #2     1    11      1
    #3     1    12      1
    #4     4    10      2
    #5     4    11      2
    #6     4    11      2
    #7     8     9      3
    #8     8     8      3
    #9     8     7      3
    

    data

    df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 8L, 8L, 8L), 
    score = c(10L, 11L, 12L, 10L, 11L, 11L, 9L, 8L, 7L)), 
    class = "data.frame", row.names = c(NA, -9L))
    
    0 讨论(0)
  • 2020-11-21 16:57

    Using the data from @Jaap, a different dplyr possibility using dense_rank() could be:

    dat %>%
     mutate(ID = dense_rank(site))
    
       site     score ID
    1     1 0.1884490  1
    2     1 0.1087422  1
    3     1 0.7438149  1
    4     8 0.1150771  3
    5     8 0.9978203  3
    6     8 0.7781222  3
    7     4 0.4081830  2
    8     4 0.2782333  2
    9     4 0.9566959  2
    10    8 0.2545320  3
    11    8 0.1201062  3
    12    8 0.5449901  3
    

    Or a rleid()-like dplyr approach, with the data arranged first:

    dat %>%
     arrange(site) %>%
     mutate(ID = with(rle(site), rep(seq_along(lengths), lengths)))
    
       site     score ID
    1     1 0.1884490  1
    2     1 0.1087422  1
    3     1 0.7438149  1
    4     4 0.4081830  2
    5     4 0.2782333  2
    6     4 0.9566959  2
    7     8 0.1150771  3
    8     8 0.9978203  3
    9     8 0.7781222  3
    10    8 0.2545320  3
    11    8 0.1201062  3
    12    8 0.5449901  3
    

    Or using duplicated() and cumsum():

    df %>%
     mutate(ID = cumsum(!duplicated(site)))
    

    The same with base R:

    df$ID <- with(rle(df$site), rep(seq_along(lengths), lengths))
    

    Or:

    df$ID <- cumsum(!duplicated(df$site))
    
    0 讨论(0)
  • 2020-11-21 16:58

    Another solution using the data.table package.

    Example with the more complete datset provided by Jaap:

    setDT(dat)[, number := frank(site, ties.method = "dense")]
    dat
        site     score number
     1:    1 0.3107920      1
     2:    1 0.3640102      1
     3:    1 0.1715318      1
     4:    8 0.7247535      3
     5:    8 0.1263025      3
     6:    8 0.4657868      3
     7:    4 0.6915818      2
     8:    4 0.3558270      2
     9:    4 0.3376173      2
    10:    8 0.7934963      3
    11:    8 0.9641918      3
    12:    8 0.9832120      3
    
    0 讨论(0)
  • 2020-11-21 16:59

    This should be fairly efficient and understandable:

    Dat$sitenum <- match(Dat$site, unique(Dat$site))  
    
    0 讨论(0)
提交回复
热议问题