Specific group rankings in R

问题

I have the data frame "Category", "ID", "Score(t)", and I want to get "Rank(t)":

Category    ID          Score.08.2007   Score.09.2007    Rank.08.2007    Rank.09.2007   ...
Orange      FSGBR070N3  0.16            ...              5               ...
Orange      FSGBR070N3  0.05            ...              7               ...
Orange      FSGBR070N3  0.11                             6
Orange      FS00008L4G  0.28                             1
Orange      FS00008VLD  0.27                             2
Orange      FS00008VLD  0.27                             2
Orange      FS00008VLD  0.27                             2
Orange      FS00009SQX  -2.03                            8
Orange      FS00009SQX  NA                          
Orange      FSUSA0A1KW  NA          
Orange      FSUSA0A1KW  NA  
Orange      FSUSA0A1KX  NA  
Orange      FSUSA0A1KY  NA  
Orange      FS0000B389  NA  
Banana      FS000092GP  96.25                            1
Banana      FS000092GP  96.25                            1
Banana      FS000092GP  96.25                            1
Banana      FS000092GP  52.33                            4
Banana      FS0000ATLN  31.73                            5
Banana      FSUSA0AVMF  1.38                             7
Banana      FSGBR058O8  1.37                             8
Banana      FSGBR05845  2.24                             6

The ranking is based on descending sorting of the "Score" in each "Category". The additional specification, which I struggle to capture, is that when there are identical scores AND identical ID's, for the following score that has a different value assign a rank equal to the rank from the previous ID plus the number of ID's that shared the same score (The rank output column in the example should make this clear).

NA's should receive no ranking:

na.last = NA

I have started with creating a matrix for ranks, then I would probably need sort(), but I struggle to capture this for the time-series and with the additional specification... couldn't find such specific existing questions either. Help appreciated!

time_series <- c("08.2007","09.2007","10.2007",...)
abs_ranks_mat <- as.data.frame(mat.or.vec(nrow(ID),length(time_series)))

回答1:

A solution uses dplyr. df is the example from @trosendal's example. df3 is the final output.

The key is to use min_rank function to create the rank. mutate_at allows us to specify which column we do or do not want to conduct ranking. After that, we can change the column names and merge with the original data frame.

library(dplyr)

df <- df %>% mutate(RowID = 1:n())

df2 <- df %>%
  group_by(Category) %>%
  mutate_at(vars(-ID, -RowID), funs(min_rank(desc(.)))) %>%
  ungroup() %>%
  select(-Category, -ID) %>%
  setNames(., gsub("Score", "Rank", colnames(.)))

df3 <- df %>% 
  left_join(df2, by = "RowID") %>%
  select(-RowID)

回答2:

Your data:

df <- structure(list(Category = c("Orange", "Orange", "Orange", "Orange", 
"Orange", "Orange", "Orange", "Orange", "Orange", "Orange", "Orange", 
"Orange", "Orange", "Orange", "Banana", "Banana", "Banana", "Banana", 
"Banana", "Banana", "Banana", "Banana"), ID = c("FSGBR070N3", 
"FSGBR070N3", "FSGBR070N3", "FS00008L4G", "FS00008VLD", "FS00008VLD", 
"FS00008VLD", "FS00009SQX", "FS00009SQX", "FSUSA0A1KW", "FSUSA0A1KW", 
"FSUSA0A1KX", "FSUSA0A1KY", "FS0000B389", "FS000092GP", "FS000092GP", 
"FS000092GP", "FS000092GP", "FS0000ATLN", "FSUSA0AVMF", "FSGBR058O8", 
"FSGBR05845"), Score.08.2007 = c(0.16, 0.05, 0.11, 0.28, 0.27, 
0.27, 0.27, -2.03, NA, NA, NA, NA, NA, NA, 96.25, 96.25, 96.25, 
52.33, 31.73, 1.38, 1.37, 2.24), Score.09.2007 = c(0.16, 0.05, 
0.14, 0.22, 0.23, 0.27, 0.27, -2.03, NA, NA, 0.14, NA, 0.56, 
NA, 96.25, 93.25, 96.25, 51.33, 31.73, 1.38, 1.37, 2.24)), .Names = c("Category", 
"ID", "Score.08.2007", "Score.09.2007"), row.names = c(NA, -22L
), class = "data.frame")

Loop over scores and generate the ranks within each category:

for(i in names(df)[grep("Score", names(df))]) {
    df[,paste0("rank", i)] <- do.call("c", lapply(unique(df$Category), function(x){
        a <- floor(rank(df[df$Category == x, i], na.last = TRUE))
        a[is.na(df[df$Category == x, i])] <- NA
        a <- max(a, na.rm = TRUE) - a + 1
        return(a)
    }))
}

df

来源：https://stackoverflow.com/questions/45692427/specific-group-rankings-in-r

标签

dataframe

ranking