I have a data frame (all_data
) in which I have a list of sites (1... to n) and their scores e.g.
site score
1 10
1 11
Another way to do it. That I think is easy to get even when you know little about R:
library(dplyr)
df <- data.frame('site' = c(1, 1, 1, 4, 4, 4, 8, 8, 8))
df <- mutate(df, 'number' = cumsum(site != lag(site, default=-1)))
You can turn site into a factor and then return the numeric or integer values of that factor:
dat <- data.frame(site = rep(c(1,4,8), each = 3), score = runif(9))
dat$number <- as.integer(factor(dat$site))
dat
site score number
1 1 0.5305773 1
2 1 0.9367732 1
3 1 0.1831554 1
4 4 0.4068128 2
5 4 0.3438962 2
6 4 0.8123883 2
7 8 0.9122846 3
8 8 0.2949260 3
9 8 0.6771526 3
In the new dplyr
1.0.0 we can use cur_group_id()
which gives a unique numeric identifier to a group.
library(dplyr)
df %>% group_by(site) %>% mutate(number = cur_group_id())
# site score number
# <int> <int> <int>
#1 1 10 1
#2 1 11 1
#3 1 12 1
#4 4 10 2
#5 4 11 2
#6 4 11 2
#7 8 9 3
#8 8 8 3
#9 8 7 3
data
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 8L, 8L, 8L),
score = c(10L, 11L, 12L, 10L, 11L, 11L, 9L, 8L, 7L)),
class = "data.frame", row.names = c(NA, -9L))
Using the data from @Jaap, a different dplyr
possibility using dense_rank()
could be:
dat %>%
mutate(ID = dense_rank(site))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 8 0.1150771 3
5 8 0.9978203 3
6 8 0.7781222 3
7 4 0.4081830 2
8 4 0.2782333 2
9 4 0.9566959 2
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or a rleid()
-like dplyr
approach, with the data arranged first:
dat %>%
arrange(site) %>%
mutate(ID = with(rle(site), rep(seq_along(lengths), lengths)))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 4 0.4081830 2
5 4 0.2782333 2
6 4 0.9566959 2
7 8 0.1150771 3
8 8 0.9978203 3
9 8 0.7781222 3
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or using duplicated()
and cumsum()
:
df %>%
mutate(ID = cumsum(!duplicated(site)))
The same with base R
:
df$ID <- with(rle(df$site), rep(seq_along(lengths), lengths))
Or:
df$ID <- cumsum(!duplicated(df$site))
Another solution using the data.table
package.
Example with the more complete datset provided by Jaap:
setDT(dat)[, number := frank(site, ties.method = "dense")]
dat
site score number
1: 1 0.3107920 1
2: 1 0.3640102 1
3: 1 0.1715318 1
4: 8 0.7247535 3
5: 8 0.1263025 3
6: 8 0.4657868 3
7: 4 0.6915818 2
8: 4 0.3558270 2
9: 4 0.3376173 2
10: 8 0.7934963 3
11: 8 0.9641918 3
12: 8 0.9832120 3
This should be fairly efficient and understandable:
Dat$sitenum <- match(Dat$site, unique(Dat$site))