How to eval spark.ml model without DataFrames/SparkContext?

前端 未结 3 622
时光说笑
时光说笑 2021-01-22 22:45

With Spark MLLib, I\'d build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict o

3条回答
  •  暖寄归人
    2021-01-22 23:07

    Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.

    Option 1

    As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.

    Option 2

    If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.

    val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
    // create dataframe from file, or make it up from some data in memory
    // use model.transform() to get predictions
    

    But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.

    Option 3

    MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:

    // Schema and hash size must stay consistent across training and prediction
    val hasher = new FeatureHasherLite(mySchema, myHashSize)
    
    // create sample data-point and hash it
    val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
    val featureVector = hasher.hash(feature)
    
    // Make prediction
    val prediction = model.predict(featureVector)
    

    You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.

提交回复
热议问题