问题
This kind of thing has been asked before, but not quite in this way that I can find.
Thread about creating sequential IDs, with several additional links
It's not hard to create identifiers in a sequence, but my data includes a temporal element that has thrown me for a loop. The following data is an imaginary dataset just to illustrate the problem in something tractable:
dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L,
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve",
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood",
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street sequence visit
1 Uptown Hennepin 1 1
2 Uptown Dupont 2 1
3 Uptown Dupont 3 1
4 Uptown Lyndale 4 1
5 Uptown Dupont 5 2
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 14thAve 2 1
13 Dinkytown 12thAve 3 2
All data are, for the sake of imagination, from three individuals walking east in three neighborhoods of Minneapolis. Each row represents a time when their location was recorded. The first column is the neighborhood they are walking through. The second column is the intersection where they were located at each time-point. The third column is the sequence that these data occurred.
I want to create the visit
column that records sequential time-points at the same street, in the same neighborhood, as a single visit, and later return visits as the next visit. How do I create this sort of sequential identifier?
I was thinking this ave()
with FUN=seq_along
trick might work, but I can't find a way of combining the factors that gets me where I want to be.
Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
Update: Uwe's solution works, but is broken if someone decides to stay at one intersection for all the measurements, which is what happened when I tried to put this to real data. If this happens, then the original number of rows are not returned to the final data.table. See what happens here:
dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2",
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street2 sequence visit_2
1 Uptown Hennepin 1 1
2 Uptown Hennepin 2 1
3 Uptown Hennepin 3 1
4 Uptown Hennepin 4 1
5 Uptown Hennepin 5 1
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 12thAve 2 1
13 Dinkytown 12thAve 3 1
in this case, running Uwe's solution returns only 6 rows.
library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
, unique(.SD, by = "visit_2")][
, visit_2 := rowid(neighborhood, street2)][
walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]
neighborhood street2 sequence visit visit_2
1: Uptown Hennepin 1 1 1
2: Downtown Hennepin 1 2 1
3: Downtown Marquette 2 3 1
4: Downtown Nicolette 3 4 1
5: Downtown Hennepin 4 5 2
6: Dinkytown 12thAve 1 6 1
回答1:
The difficulty here is that subsequent recordings to the same street in the same neighborhood should be counted as one visit. This requires to collapse these rows into one, count the visits to different neighborhoods & streets and finally expand this to the original number of rows.
Note that column visit
containing the expected result is not overwritten but kept for comparison with the computed visit_new
column.
library(data.table)
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 1 3: Uptown Dupont 3 1 1 4: Uptown Lyndale 4 1 1 5: Uptown Dupont 5 2 2 6: Downtown Hennepin 1 1 1 7: Downtown Marquette 2 1 1 8: Downtown Nicolette 3 1 1 9: Downtown Hennepin 4 2 2 10: Downtown Hennepin 5 2 2 11: Dinkytown 12thAve 1 1 1 12: Dinkytown 14thAve 2 1 1 13: Dinkytown 12thAve 3 2 2
Explanation step by step
DF
is coerced to data.table. The rleid()
function creates unique numbers for changes in neighborhood & street.
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 2 3: Uptown Dupont 3 1 2 4: Uptown Lyndale 4 1 3 5: Uptown Dupont 5 2 4 6: Downtown Hennepin 1 1 5 7: Downtown Marquette 2 1 6 8: Downtown Nicolette 3 1 7 9: Downtown Hennepin 4 2 8 10: Downtown Hennepin 5 2 8 11: Dinkytown 12thAve 1 1 9 12: Dinkytown 14thAve 2 1 10 13: Dinkytown 12thAve 3 2 11
Note that rows 2 & 3 are repeated as well as rows 9 & 10. The duplicates are removed in the next step which creates a new, temporary data.table object:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 2 3: Uptown Lyndale 4 1 3 4: Uptown Dupont 5 2 4 5: Downtown Hennepin 1 1 5 6: Downtown Marquette 2 1 6 7: Downtown Nicolette 3 1 7 8: Downtown Hennepin 4 2 8 9: Dinkytown 12thAve 1 1 9 10: Dinkytown 14thAve 2 1 10 11: Dinkytown 12thAve 3 2 11
Now, we can number the visits to distinct neighborhoods and streets using the rowid()
function:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 1 3: Uptown Lyndale 4 1 1 4: Uptown Dupont 5 2 2 5: Downtown Hennepin 1 1 1 6: Downtown Marquette 2 1 1 7: Downtown Nicolette 3 1 1 8: Downtown Hennepin 4 2 2 9: Dinkytown 12thAve 1 1 1 10: Dinkytown 14thAve 2 1 1 11: Dinkytown 12thAve 3 2 2
Finally, we need to expand the result to the original number of rows again. This is accomplished by a rolling join of the temporary data.table with the original DF
(all rows included):
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
Perhaps, it might be worth to note that visit_new
is used and re-used to hold temporary data through various stages until the final update.
New data set
The fixed code works also with the second data set provided by the OP:
walking_dat_2 <-
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L,
3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street",
"sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")
setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat_2, on = .(neighborhood, street, sequence),
roll = TRUE, .SD]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Hennepin 2 1 1 3: Uptown Hennepin 3 1 1 4: Uptown Hennepin 4 1 1 5: Uptown Hennepin 5 1 1 6: Downtown Hennepin 1 1 1 7: Downtown Marquette 2 1 1 8: Downtown Nicolette 3 1 1 9: Downtown Hennepin 4 2 2 10: Downtown Hennepin 5 2 2 11: Dinkytown 12thAve 1 1 1 12: Dinkytown 12thAve 2 1 1 13: Dinkytown 12thAve 3 1 1
回答2:
# Not required, but convenient:
walking_dat$combo <- paste(walking_dat$neighborhood, walking_dat$street)
# Place holder:
walking_dat$visit <- NA
# Create it:
for(i in 1:nrow(walking_dat)){
if(i %in% row.names(walking_dat[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0)), ])){
walking_dat$visit[i] <- sum(walking_dat$combo[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0))][1:i]==walking_dat$combo[i], na.rm=T)
} else{
walking_dat$visit[i] <- 1
}
}
walking_dat
neighborhood street sequence visit combo 1 Uptown Hennepin 1 1 Uptown Hennepin 2 Uptown Dupont 2 1 Uptown Dupont 3 Uptown Dupont 3 1 Uptown Dupont 4 Uptown Lyndale 4 1 Uptown Lyndale 5 Uptown Dupont 5 2 Uptown Dupont 6 Downtown Hennepin 1 1 Downtown Hennepin 7 Downtown Marquette 2 1 Downtown Marquette 8 Downtown Nicolette 3 1 Downtown Nicolette 9 Downtown Hennepin 4 2 Downtown Hennepin 10 Downtown Hennepin 5 1 Downtown Hennepin 11 Dinkytown 12thAve 1 2 Dinkytown 12thAve 12 Dinkytown 14thAve 2 1 Dinkytown 14thAve 13 Dinkytown 12thAve 3 2 Dinkytown 12thAve
来源:https://stackoverflow.com/questions/46507540/creating-unique-groups-in-sequential-data-that-repeats-through-time