问题
My question is continuation of this probem Link
I have a dataset such as this one:
ID | Date
A 01/01/2015
A 02/01/2015
A 02/01/2015
A 02/01/2015
A 05/01/2015
B 01/01/2015
I want to rank each date by a referential date - 31/01/2015. The closest date to the referential date being ranked 1, second 2, and so on. The result would look like:
ID | Date | Sequence
A 01/01/2015 3
A 02/01/2015 2
A 02/01/2015 2
A 02/01/2015 2
A 05/01/2015 1
B 01/01/2015 ...
While the rank function does think, I also want to keep all the ties. How do I do that?
Also, I am working with a huge dataset - approx. 300 million rows. So the solution would ideally be fast.
回答1:
We can use frank
from data.table
with dense
as ties.method
after grouping by 'ID' on the abs
olute difference between the 'Date' and the reference date ('2015-01-31')
library(data.table)
setDT(df)[, Sequence := frank(abs(as.IDate(Date, "%d/%m/%Y")-
as.IDate("2015-01-31")), ties.method = "dense"), by = ID]
df
# ID Date Sequence
#1: A 01/01/2015 3
#2: A 02/01/2015 2
#3: A 02/01/2015 2
#4: A 02/01/2015 2
#5: A 05/01/2015 1
#6: B 01/01/2015 1
data
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B"), Date = c("01/01/2015",
"02/01/2015", "02/01/2015", "02/01/2015", "05/01/2015", "01/01/2015"
)), .Names = c("ID", "Date"), class = "data.frame", row.names = c(NA,
-6L))
回答2:
With dplyr
dense_rank
:
library(dplyr)
df$Sequence <- dense_rank(as.numeric(as.Date('31/01/2015', '%d/%m/%Y') - as.Date(df$Date, '%d/%m/%Y')))
head(df)
ID Date Sequence
1 A 01/01/2015 3
2 A 02/01/2015 2
3 A 02/01/2015 2
4 A 02/01/2015 2
5 A 05/01/2015 1
6 B 01/01/2015 3
回答3:
Here is a data.table
method that would work.
rleid
returns "IDs" for the same date by group ID. However, these IDs count up from 0. In the second chain, [
, (max(var) - var) + 1L
reverses these date IDs for each ID group.
df[, var:=rleid(Date), by=ID][, var := (max(var) - var) + 1L, by=ID]
df
ID Date var
1: A 01/01/2015 3
2: A 02/01/2015 2
3: A 02/01/2015 2
4: A 02/01/2015 2
5: A 05/01/2015 1
6: B 01/01/2015 1
回答4:
Base R
solution. First get your days and the target date by converting them to Date
objects and taking the absolute value of the difference.
timediff <- abs(as.Date(df[["Date"]], format = "%d/%m/%Y") - as.Date("2015-01-31"))
Next we can use rank
to get the order of them. We can use any ties.method
that produces a single value for the ties, but "min"
or "max"
would probably be best since they output integers.
diffrank <- rank(timediff, ties.method = "min")
Finally we can use this solution for reordering the rank to remove gaps between instances.
df[["Sequence"]] <- as.numeric(factor(diffrank))
If you would like, this can all be done in one line:
df[["Sequence"]] <- as.numeric(factor(rank(
abs(as.Date(df[["Date"]], format = "%d/%m/%Y") -
as.Date("2015-01-31")), ties.method = "min")))
来源:https://stackoverflow.com/questions/41448146/rank-by-two-columns-and-keep-ties