问题
I need to implement a ranking algorithm for numeric vectors. I don't know if it's possible to do it using functions like rank(), order() or sort() in R, or if I should hard-code it. Either way, I could not do it.
The algorithm works as follows:
Let x = (x_1,x_2...,x_n) and y = (y_1,y_2,...y_n) be two vectors. We need to build the vector z composed of the ranked elements of x this way:
If x_i < x_j then z_i < z_j
If x_i = x_j then
z_i < z_j if y_i < y_j
z_i > z_j if y_i > y_j
z_i = z_j if y_i = y_j
If x_i is NA (missing) then
z_i > z_j if z_j is not NA
z_i = z_j if z_j is NA
For example, if x = (30,15,27,49,15) and y = (12,11,10,9,8) then z = (4,2,3,5,1)
I think I could use order(order(x,y, na.last=T))
and in fact it worked as long as the ties in x do not tie in y as well. If that's the case, then order()
will rank them in order of appearance instead of leaving them tied.
For example, if x = (30,15,27,49,15) and y = (12,8,10,9,8) then order(order(x,y, na.last=T))
will output z = (4,1,3,5,2) instead of z = (4,1,3,5,1) or another z (such as (3,1,2,4,1)) that respects step 2.
I could not escape that. How can I proceed?
回答1:
tl;dr: I think version 1 is best. Versions 2 and 3 were early ideas that are not as good, but I leave them here in case they are useful to anyone.
Unfortunately rank
does not provide the ability to break ties using a second vector (a useful capability that order
and sort
do allow).
Version 1
But, library(data.table)
provides frank()
which does the job nicely.
x = c(30,15,27,49,15)
y = c(12,11,10,9,8)
frank(list(x,y), ties.method = "min")
# [1] 4 2 3 5 1
x = c(30,15,27,49,15)
y = c(12,8,10,9,8)
frank(list(x,y), ties.method = "min")
# [1] 4 1 3 5 1
Note that frank
also provides another option for ties.method = "dense"
which may be better for some uses, because it does not skip ranks (i.e. when two values are given rank 1, the next largest gets rank 2, rather than 3) - see below for an example
frank(list(x,y), ties.method = "dense")
[1] 3 1 2 4 1
Version 2
If you want to stick to base R, one simple workaround would be to rank x * K + y
, where K is any number sufficiently large that adding the largest y
to any x*K
cannot change the order:
ranky = function(x,y) {
K = 1 + max(y) / min(diff(sort(unique(x))))
rank(x*K + y, ties.method = 'min')
}
ranky(c(30,15,27,49,15), c(12,11,10,9,8) )
# [1] 4 2 3 5 1
ranky(c(30,15,27,49,15), c(12,8,10,9,8))
# [1] 4 1 3 5 1
Version 3
Also in base R, you could paste together fixed-width string representations of each and then rank the combined character vector.
rank(paste(
formatC(x, width = 15, flag = "0"),
formatC(y, width = 15, flag = "0")),
ties.method = 'min')
回答2:
An option using data.table
:
library(data.table)
f <- function(x, y) {
data.table(x, y)[order(x, y), r := .I][, r := min(r), .(x, y)]$r
}
f(c(30,15,27,49,15), c(12,11,10,9,8))
#[1] 4 2 3 5 1
f(c(30,15,27,49,15), c(12,8,10,9,8))
#[1] 4 1 3 5 1
Or what should be a faster version:
f <- function(x, y) {
DT <- setindex(data.table(x, y), x, y)[order(x, y), r := .I]
if (uniqueN(data.table(x, y))==DT[, .N])
DT$r
else
DT[,r := min(r), .(x, y)]$r
}
回答3:
You could write a function to do this:
my_order <- function(x,y){
a <- rank(x,ties.method = "first")
b <- `class<-`(names(which(table(x)>1)),class(x))
c(apply(outer(x,b,'=='),2,function(m)a[m]<<-a[m][rank(y[m])]))
a
}
The reason for the apply
function is because we can have more than one repeated value:
x = c(30,15,27,49,15) ;
y = c(12,8,10,9,8)
my_order(x,y)
[1] 4 1 3 5 1
my_order(c(2,1,1,2),c(6,4,2,6))
[1] 3 2 1 3
compare with
order(order(c(2,1,1,2),c(6,4,2,6)))
[1] 3 2 1 4
来源:https://stackoverflow.com/questions/60217096/how-to-rank-a-vector-using-a-second-vector-as-a-tie-breaker