I have just installed sparkR 1.6.1 on CentOS and am not using hadoop. My code to model data with discrete \'TARGET\' values is as follows:
# \'tr\' is a R data f
Background: when your R code requests a result of some computation from Spark backend, Spark does the computation and serializes the result. This result is then deserialized on the R side and you get your R objects.
Now, the way it works on the Spark backend is -- if it figures that the type of the object to be returned is one of Character
, String
, Long
, Float
, Double
Integer
, Boolean
, Date
, TimeStamp
or their Array
etc, then it serializes the object. But if it finds that the type does not match any of these, it simply assign the object an id, stores it in memory against that id, and sends this id to R client. (JVMObjectTracker
in RBackendHandler is responsible for keeping track of jvm object on spark backend.) This is then deserialized into jobj class on the R side. (You can look at writeObject
method of SerDe.scala to get the full picture of what is serialized upfront and what is not.)
Now, on the R side if you look at the objects in probability
column of your predictions
data frame, you will observe that their class is jobj
. As mentioned, the objects of this class act as proxy to the actual Java objects held on the Spark cluster. In this particular case the backing java class is org.apache.spark.mllib.linalg.DenseVector
. This is a vector as it contains the probability for each class. And because this vector is not one of the serialized type supported by the SerDe class, spark backend just returns jobj
proxy and stores these DenseVector
object in memory so as to allow future operations on them.
With that background -- you should be able to get probability values on you R frontend by invoking methods on these DenseVector
objects. As of now, I think this is the only way. Following is the code that works for iris data set --
irisDf <- createDataFrame(sqlContext, iris)
irisDf$target <- irisDf$Species == 'setosa'
model <- glm(target ~ . , data = irisDf, family = "binomial")
summary(model)
predictions <- predict(model, newData = irisDf)
modelPrediction <- select(predictions, "probability")
localPredictions <- SparkR:::as.data.frame(predictions)
getValFrmDenseVector <- function(x) {
#Given it's binary classification there are just two elems in vector
a <- SparkR:::callJMethod(x$probability, "apply", as.integer(0))
b <- SparkR:::callJMethod(x$probability, "apply", as.integer(1))
c(a, b)
}
t(apply(localPredictions, 1, FUN=getValFrmDenseVector))
with this I get the following probabilty output for two classes --
[,1] [,2]
1 3.036612e-15 1.000000e+00
2 5.919287e-12 1.000000e+00
3 7.831827e-14 1.000000e+00
4 7.712003e-13 1.000000e+00
5 4.427117e-16 1.000000e+00
6 3.816329e-16 1.000000e+00
[...]
Note: SparkR:::
prefixed functions are not exported in the SparkR package namespace. So be keep in mind that you're coding against package private implementaion. (But I don't really see how this can be achieved otherwise, unless Spark provides a public API support for it.)