Strange behavior when using apply with rank and order on a data.frame with ordered factors

前端 未结 2 1127
孤城傲影
孤城傲影 2021-01-23 22:23

I\'ve found some weird behavior with apply.

Assume I have an arbitrary matrix of ordered variables

set.seed(4)
x <- ordered(sample(1:10,          


        
2条回答
  •  遥遥无期
    2021-01-23 22:33

    As requested by the OP, here is a detailed explanation which may help other R users to evade the traps.

    Trap 1

    As joran has pointed out, apply coerces the data frame into a matrix thereby replacing the ordered factors by characters. So, the original data.frame

    data1
      x  y  z
    1 6  9 10
    2 1  3  1
    3 3  8  8
    4 3 10  3
    

    becomes

    as.matrix(data1)
         x   y    z   
    [1,] "6" "9"  "10"
    [2,] "1" "3"  "1" 
    [3,] "3" "8"  "8" 
    [4,] "3" "10" "3" 
    

    Trap 2

    Characters are sorted lexically. Thus, sorting the y column as character returns

    sort(c("9", "3", "8", "10"))
    [1] "10" "3"  "8"  "9" 
    

    instead of

    sort(c(9, 3, 8, 10))
    [1]  3  8  9 10
    

    This explains why apply returns a different result for the rank operation here.

    Solution

    You can use lapply to compute the rank of each column of the data frame.

    as.data.frame(lapply(data1, rank))
        x y z
    1 4.0 3 4
    2 1.0 1 1
    3 2.5 2 3
    4 2.5 4 2
    

    lapply returns a list and a data frame is a special kind of list.

    Avoid sapply because sapply takes the output of lapplyand "simplifies" it to something what it thinks is appropriate. Here,

    sapply(data1, rank)
           x y z
    [1,] 4.0 3 4
    [2,] 1.0 1 1
    [3,] 2.5 2 3
    [4,] 2.5 4 2
    

    returns a matrix (again!) which needs to be coerced to a data frame. (See chapter 8.3.20 of The R Inferno by Patrick Burns.The text is a good read, anyway.)

    Alternative Solution

    The OP has not given an indication why he needs to work with ordered factors. If factors, ordered or not, are not essential to the OPs underlying problem, then applywould have worked as expected.

    set.seed(4)
    x2 <- sample(1:10, size = 4, replace = T)
    y2 <- sample(1:10, size = 4, replace = T)
    z2 <- sample(1:10, size = 4, replace = T)
    data2 <- data.frame(x2, y2, z2)
    data2
      x2 y2 z2
    1  6  9 10
    2  1  3  1
    3  3  8  8
    4  3 10  3
    apply(data2, 2, rank) 
      x2 y2 z2
    [1,] 4.0  3  4
    [2,] 1.0  1  1
    [3,] 2.5  2  3
    [4,] 2.5  4  2
    

    (Nevertheless, better to use lapply instead of apply with a data frame).

    Trap 3

    When I started to learn R, I was misled by the name of the function ordered(). It took me a while to understand that it creates a special kind of factors. Likewise, it took me some time to figure out the difference between sort() and order() and when to use which function appropriately.

提交回复
热议问题