问题
PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
回答1:
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer
-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame
with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing
subproject.
I actually went with 2) and added implode
, explode
and join
as methods to the DefaultMleapFrame
and also a HashIndexedMleapFrame
that allows for fast joins. I did not implement groupby
and agg
, but in Scala this is relatively easy to accomplish.
回答2:
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).
来源:https://stackoverflow.com/questions/53380005/export-spark-feature-transformation-pipeline-to-a-file