Rank by two columns and keep ties

问题

My question is continuation of this probem Link

I have a dataset such as this one:

 ID    |     Date 

  A        01/01/2015
  A        02/01/2015
  A        02/01/2015
  A        02/01/2015
  A        05/01/2015     
  B        01/01/2015

I want to rank each date by a referential date - 31/01/2015. The closest date to the referential date being ranked 1, second 2, and so on. The result would look like:

  ID    |     Date           |  Sequence

  A        01/01/2015           3
  A        02/01/2015           2
  A        02/01/2015           2
  A        02/01/2015           2
  A        05/01/2015           1  
  B        01/01/2015          ...

While the rank function does think, I also want to keep all the ties. How do I do that?

Also, I am working with a huge dataset - approx. 300 million rows. So the solution would ideally be fast.

回答1:

We can use frank from data.table with dense as ties.method after grouping by 'ID' on the absolute difference between the 'Date' and the reference date ('2015-01-31')

library(data.table)
setDT(df)[, Sequence := frank(abs(as.IDate(Date, "%d/%m/%Y")- 
              as.IDate("2015-01-31")), ties.method = "dense"), by = ID]
df
#    ID       Date Sequence
#1:  A 01/01/2015        3
#2:  A 02/01/2015        2
#3:  A 02/01/2015        2
#4:  A 02/01/2015        2
#5:  A 05/01/2015        1
#6:  B 01/01/2015        1

data

df <- structure(list(ID = c("A", "A", "A", "A", "A", "B"), Date = c("01/01/2015", 
 "02/01/2015", "02/01/2015", "02/01/2015", "05/01/2015", "01/01/2015"
)), .Names = c("ID", "Date"), class = "data.frame", row.names = c(NA, 
-6L))

回答2:

With dplyr dense_rank:

library(dplyr)
df$Sequence <- dense_rank(as.numeric(as.Date('31/01/2015', '%d/%m/%Y') - as.Date(df$Date, '%d/%m/%Y')))
head(df) 

  ID       Date Sequence
1  A 01/01/2015        3
2  A 02/01/2015        2
3  A 02/01/2015        2
4  A 02/01/2015        2
5  A 05/01/2015        1
6  B 01/01/2015        3

回答3:

Here is a data.table method that would work.

rleid returns "IDs" for the same date by group ID. However, these IDs count up from 0. In the second chain, [, (max(var) - var) + 1L reverses these date IDs for each ID group.

df[, var:=rleid(Date), by=ID][, var := (max(var) - var) + 1L, by=ID]
df
   ID       Date var
1:  A 01/01/2015   3
2:  A 02/01/2015   2
3:  A 02/01/2015   2
4:  A 02/01/2015   2
5:  A 05/01/2015   1
6:  B 01/01/2015   1

回答4:

Base R solution. First get your days and the target date by converting them to Date objects and taking the absolute value of the difference.

timediff <- abs(as.Date(df[["Date"]], format = "%d/%m/%Y") - as.Date("2015-01-31"))

Next we can use rank to get the order of them. We can use any ties.method that produces a single value for the ties, but "min" or "max" would probably be best since they output integers.

diffrank <- rank(timediff, ties.method = "min")

Finally we can use this solution for reordering the rank to remove gaps between instances.

df[["Sequence"]] <- as.numeric(factor(diffrank))

If you would like, this can all be done in one line:

df[["Sequence"]] <- as.numeric(factor(rank(
                        abs(as.Date(df[["Date"]], format = "%d/%m/%Y") - 
                               as.Date("2015-01-31")), ties.method = "min")))

来源：https://stackoverflow.com/questions/41448146/rank-by-two-columns-and-keep-ties

标签

data.table

dplyr

rank