How to rank a vector using a second vector as a tie breaker?

问题

I need to implement a ranking algorithm for numeric vectors. I don't know if it's possible to do it using functions like rank(), order() or sort() in R, or if I should hard-code it. Either way, I could not do it.

The algorithm works as follows:

Let x = (x_1,x_2...,x_n) and y = (y_1,y_2,...y_n) be two vectors. We need to build the vector z composed of the ranked elements of x this way:

If x_i < x_j then z_i < z_j
If x_i = x_j then
- z_i < z_j if y_i < y_j
- z_i > z_j if y_i > y_j
- z_i = z_j if y_i = y_j
If x_i is NA (missing) then
- z_i > z_j if z_j is not NA
- z_i = z_j if z_j is NA

For example, if x = (30,15,27,49,15) and y = (12,11,10,9,8) then z = (4,2,3,5,1)

I think I could use order(order(x,y, na.last=T)) and in fact it worked as long as the ties in x do not tie in y as well. If that's the case, then order() will rank them in order of appearance instead of leaving them tied.

For example, if x = (30,15,27,49,15) and y = (12,8,10,9,8) then order(order(x,y, na.last=T)) will output z = (4,1,3,5,2) instead of z = (4,1,3,5,1) or another z (such as (3,1,2,4,1)) that respects step 2.

I could not escape that. How can I proceed?

回答1:

tl;dr: I think version 1 is best. Versions 2 and 3 were early ideas that are not as good, but I leave them here in case they are useful to anyone.

Unfortunately rank does not provide the ability to break ties using a second vector (a useful capability that order and sort do allow).

Version 1

But, library(data.table) provides frank() which does the job nicely.

x = c(30,15,27,49,15) 
y = c(12,11,10,9,8) 
frank(list(x,y), ties.method = "min")
# [1] 4 2 3 5 1

x = c(30,15,27,49,15) 
y = c(12,8,10,9,8)
frank(list(x,y), ties.method = "min")
# [1] 4 1 3 5 1

Note that frank also provides another option for ties.method = "dense" which may be better for some uses, because it does not skip ranks (i.e. when two values are given rank 1, the next largest gets rank 2, rather than 3) - see below for an example

frank(list(x,y), ties.method = "dense")
[1] 3 1 2 4 1

Version 2

If you want to stick to base R, one simple workaround would be to rank x * K + y, where K is any number sufficiently large that adding the largest y to any x*K cannot change the order:

ranky = function(x,y) {
  K = 1 +  max(y) / min(diff(sort(unique(x))))
  rank(x*K + y, ties.method = 'min')
}

ranky(c(30,15,27,49,15), c(12,11,10,9,8) )
# [1] 4 2 3 5 1    
ranky(c(30,15,27,49,15), c(12,8,10,9,8))
# [1] 4 1 3 5 1

Version 3

Also in base R, you could paste together fixed-width string representations of each and then rank the combined character vector.

rank(paste(
      formatC(x, width = 15, flag = "0"), 
      formatC(y, width = 15, flag = "0")), 
     ties.method = 'min')

回答2:

An option using data.table:

library(data.table)
f <- function(x, y) {
    data.table(x, y)[order(x, y), r := .I][, r := min(r), .(x, y)]$r
}

f(c(30,15,27,49,15), c(12,11,10,9,8))
#[1] 4 2 3 5 1

f(c(30,15,27,49,15), c(12,8,10,9,8))
#[1] 4 1 3 5 1

Or what should be a faster version:

f <- function(x, y) {
    DT <- setindex(data.table(x, y), x, y)[order(x, y), r := .I]

    if (uniqueN(data.table(x, y))==DT[, .N]) 
        DT$r
    else 
        DT[,r := min(r), .(x, y)]$r
}

回答3:

You could write a function to do this:

my_order <- function(x,y){
  a <- rank(x,ties.method = "first")
  b <- `class<-`(names(which(table(x)>1)),class(x))
  c(apply(outer(x,b,'=='),2,function(m)a[m]<<-a[m][rank(y[m])]))
  a
}

The reason for the apply function is because we can have more than one repeated value:

x = c(30,15,27,49,15) ;
y = c(12,8,10,9,8) 
my_order(x,y)
[1] 4 1 3 5 1

my_order(c(2,1,1,2),c(6,4,2,6))
[1] 3 2 1 3

compare with

order(order(c(2,1,1,2),c(6,4,2,6)))
[1] 3 2 1 4

来源：https://stackoverflow.com/questions/60217096/how-to-rank-a-vector-using-a-second-vector-as-a-tie-breaker

标签

sorting

ranking