How can you efficiently build one ML model per partition in Spark with foreachPartition?

不问归期 提交于 2020-06-17 05:34:29

问题


I am trying to fit one ML model for each partition of my dataset, and I do not know how to do it in Spark.

My dataset is basically looking like this and is partitioned by Company:

Company | Features | Target

A         xxx        0.9
A         xxx        0.8
A         xxx        1.0
B         xxx        1.2
B         xxx        1.0
B         xxx        0.9
C         xxx        0.7
C         xxx        0.9
C         xxx        0.9

My goal is to train one regressor for each company, in a parallelised way (I have a few hundred millions of records, with 100k companies). My intuition is that I need to use foreachPartition to have the partitions (i.e. my companies) processed in parallel and each company model trained and saved. My main problem is about how to deal with iterator type that is to be used in the function called by foreachPartition.

Here is what it would look like:

dd.foreachPartition(

    iterator => {var company_df = operator.toDF()
                 var rg = RandomForestRegressor()
                                 .setLabelCol("target")
                                 .setFeaturesCol("features")
                                 .setNumTrees(10)
                 var model = rg.fit(company_df)
                 model.write.save(company_path)
                 }
)

As I understand it, trying to convert the iterator into a dataframe is not a possibility since the concept of RDD is cannot exist by itself in a foreachPartition statement.

I know the question is quite open but I am really stuck.


回答1:


In pyspark you can do something like below

import statsmodels.api as sm
# df has four columns: id, y, x1, x2

group_column = 'id'
y_column = 'y'
x_columns = ['x1', 'x2']
schema = df.select(group_column, *x_columns).schema

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def ols(pdf):
    group_key = pdf[group_column].iloc[0]
    y = pdf[y_column]
    X = pdf[x_columns]
      X = sm.add_constant(X)
    model = sm.OLS(y, X).fit()

    return pd.DataFrame([[group_key] + [model.params[i] for i in   x_columns]], columns=[group_column] + x_columns)

beta = df.groupby(group_column).apply(ols)


来源:https://stackoverflow.com/questions/58592081/how-can-you-efficiently-build-one-ml-model-per-partition-in-spark-with-foreachpa

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!