With Spark MLLib, I\'d build a model (like RandomForest
), and then it was possible to eval it outside of Spark by loading the model and using predict
o
Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
Here is my solution to use spark models outside of spark context (using PMML):
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);