sparkR 1.6: How to predict probability when modeling with glm (binomial family)

前端 未结 1 366
忘掉有多难
忘掉有多难 2021-01-24 00:20

I have just installed sparkR 1.6.1 on CentOS and am not using hadoop. My code to model data with discrete \'TARGET\' values is as follows:

# \'tr\' is a R data f         


        
相关标签:
1条回答
  • 2021-01-24 00:36

    Background: when your R code requests a result of some computation from Spark backend, Spark does the computation and serializes the result. This result is then deserialized on the R side and you get your R objects.

    Now, the way it works on the Spark backend is -- if it figures that the type of the object to be returned is one of Character, String, Long, Float, Double Integer, Boolean, Date, TimeStamp or their Array etc, then it serializes the object. But if it finds that the type does not match any of these, it simply assign the object an id, stores it in memory against that id, and sends this id to R client. (JVMObjectTracker in RBackendHandler is responsible for keeping track of jvm object on spark backend.) This is then deserialized into jobj class on the R side. (You can look at writeObject method of SerDe.scala to get the full picture of what is serialized upfront and what is not.)

    Now, on the R side if you look at the objects in probability column of your predictions data frame, you will observe that their class is jobj. As mentioned, the objects of this class act as proxy to the actual Java objects held on the Spark cluster. In this particular case the backing java class is org.apache.spark.mllib.linalg.DenseVector. This is a vector as it contains the probability for each class. And because this vector is not one of the serialized type supported by the SerDe class, spark backend just returns jobj proxy and stores these DenseVector object in memory so as to allow future operations on them.

    With that background -- you should be able to get probability values on you R frontend by invoking methods on these DenseVector objects. As of now, I think this is the only way. Following is the code that works for iris data set --

    irisDf <- createDataFrame(sqlContext, iris)
    irisDf$target <- irisDf$Species == 'setosa'
    model <- glm(target ~ . , data = irisDf, family = "binomial")
    summary(model)
    predictions <- predict(model, newData = irisDf)
    modelPrediction <- select(predictions, "probability")
    localPredictions <- SparkR:::as.data.frame(predictions)
    
    getValFrmDenseVector <- function(x) {
        #Given it's binary classification there are just two elems in vector
        a <- SparkR:::callJMethod(x$probability, "apply", as.integer(0))
        b <- SparkR:::callJMethod(x$probability, "apply", as.integer(1))
        c(a, b)
    }
    
    t(apply(localPredictions, 1, FUN=getValFrmDenseVector))
    

    with this I get the following probabilty output for two classes --

            [,1]         [,2]
    1   3.036612e-15 1.000000e+00
    2   5.919287e-12 1.000000e+00
    3   7.831827e-14 1.000000e+00
    4   7.712003e-13 1.000000e+00
    5   4.427117e-16 1.000000e+00
    6   3.816329e-16 1.000000e+00
    [...]
    

    Note: SparkR::: prefixed functions are not exported in the SparkR package namespace. So be keep in mind that you're coding against package private implementaion. (But I don't really see how this can be achieved otherwise, unless Spark provides a public API support for it.)

    0 讨论(0)
提交回复
热议问题