Matrix multiplication in R: requires numeric/complex matrix/vector arguments

前端 未结 1 813
耶瑟儿~
耶瑟儿~ 2021-01-18 21:38

I\'m using the dataset BreastCancer in the mlbench package, and I am trying to do the following matrix multiplication as a part of logistic regress

相关标签:
1条回答
  • 2021-01-18 21:55

    Organizing our long-winded discussion in comments to an answer.

    Matrix-multiplication operators / functions like "%*%",crossprod,tcrossprod` expects matrices with "numeric", "complex" or "logical" mode. However, your matrix has "character" mode.

    library(mlbench)
    data(BreastCancer)
    X <- as.matrix(BreastCancer[, 1:10])
    mode(X)
    #[1] "character"
    

    You might be surprised as the dataset seems to hold numeric data:

    head(BreastCancer[, 1:10])
    #       Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
    #1 1000025            5         1          1             1            2
    #2 1002945            5         4          4             5            7
    #3 1015425            3         1          1             1            2
    #4 1016277            6         8          8             1            3
    #5 1017023            4         1          1             3            2
    #6 1017122            8        10         10             8            7
    #  Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
    #1           1           3               1       1
    #2          10           3               2       1
    #3           2           3               1       1
    #4           4           3               7       1
    #5           1           3               1       1
    #6          10           9               7       1
    

    But you are misinformed by the printing style. These columns are in fact characters or factors:

    lapply(BreastCancer[, 1:10], class)
    #$Id
    #[1] "character"
    #
    #$Cl.thickness
    #[1] "ordered" "factor" 
    #
    #$Cell.size
    #[1] "ordered" "factor" 
    #
    #$Cell.shape
    #[1] "ordered" "factor" 
    #
    #$Marg.adhesion
    #[1] "ordered" "factor" 
    #
    #$Epith.c.size
    #[1] "ordered" "factor" 
    #
    #$Bare.nuclei
    #[1] "factor"
    #
    #$Bl.cromatin
    #[1] "factor"
    #
    #$Normal.nucleoli
    #[1] "factor"
    #
    #$Mitoses
    #[1] "factor"
    

    When you do as.matrix, these columns are all coerced to "character" (see R: Why am I not getting type or class "factor" after converting columns to factor? for a thorough explanation).

    So to do the matrix-multiplication, we need to correctly coerce these columns to "numeric".


    dat <- BreastCancer[, 1:10]
    
    ## character to numeric
    dat[[1]] <- as.numeric(dat[[1]])
    
    ## factor to numeric
    dat[2:10] <- lapply( dat[2:10], function (x) as.numeric(levels(x))[x] )
    
    ## get the matrix
    X <- data.matrix(dat)
    mode(X)
    #[1] "numeric"
    

    Now you can do for example, a matrix-vector multiplication.

    ## some possible matrix-vector multiplications
    beta <- runif(10)
    yhat <- X %*% beta
    
    ## add prediction back to data frame
    dat$prediction <- yhat
    

    However, I doubt this is the correct way to obtain predicted values for you logistic regression model as when you build your model with factors, the model matrix is not the above X but a dummy matrix. I highly recommend you using predict.


    This line also worked for me: as.matrix(sapply(dat, as.numeric))

    Looks like you were lucky. The dataset happens to have factor levels as same as numeric values. In general, converting a factor to numeric should use the method I did. Compare

    f <- gl(4, 2, labels = c(12.3, 0.5, 2.9, -11.1))
    #[1] 12.3  12.3  0.5   0.5   2.9   2.9   -11.1 -11.1
    #Levels: 12.3 0.5 2.9 -11.1
    
    as.numeric(f)
    #[1] 1 1 2 2 3 3 4 4
    
    as.numeric(levels(f))[f]
    #[1] 12.3  12.3  0.5   0.5   2.9   2.9   -11.1 -11.1
    

    This is covered at the doc page ?factor.

    0 讨论(0)
提交回复
热议问题