Pararelization of sklearn Pipeline

匆匆过客 提交于 2019-12-14 02:06:26

问题


I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below:

huber_pipe = Pipeline([
        ("DATA_CLEANER", DataCleaner()),
        ("DATA_ENCODING", Encoder(encoder_name='code')),
        ("SCALE", Normalizer()),
        ("FEATURE_SELECTION", huber_feature_selector),
        ("MODELLING", huber_model)
    ])

Is it possible to run the steps of the pipeline in different threads or cores?


回答1:


In general, no.

If you look at the interface for sklearn stages, the methods are of the form:

fit(X, y, other_stuff)

predict(X)

That is, they work on the entire dataset, and can't do incremental learning on streams (or chunked streams) of data.

Moreover, fundamentally, some of the algorithms are not amenable to this. Consider for example your stage

("SCALE", Normalizer()),

Presumably, this normalizes using mean and/or variance. Without seeing the entire dataset, how can it know these things? It must therefore wait for the entire input before operating, and hence can't be run in parallel with the stages after it. Most (if not nearly all) stages are like that.


However, in some cases, you still can use multicores with sklearn.

  1. Some stages have an n_jobs parameter. Stages like this use sequentially relative to other stages, but can parallelize the work within.

  2. In some cases you can roll your own (approximate) parallel versions of other stages. E.g., given any regressor stage, you can wrap it in a stage that randomly chunks your data into n parts, learns the parts in parallel, and outputs a regressor that is the average of all the regressors. YMMV.



来源:https://stackoverflow.com/questions/43785067/pararelization-of-sklearn-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!