Pararelization of sklearn Pipeline

问题

I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below:

huber_pipe = Pipeline([
        ("DATA_CLEANER", DataCleaner()),
        ("DATA_ENCODING", Encoder(encoder_name='code')),
        ("SCALE", Normalizer()),
        ("FEATURE_SELECTION", huber_feature_selector),
        ("MODELLING", huber_model)
    ])

Is it possible to run the steps of the pipeline in different threads or cores?

回答1:

In general, no.

If you look at the interface for sklearn stages, the methods are of the form:

fit(X, y, other_stuff)

predict(X)

That is, they work on the entire dataset, and can't do incremental learning on streams (or chunked streams) of data.

Moreover, fundamentally, some of the algorithms are not amenable to this. Consider for example your stage

("SCALE", Normalizer()),

Presumably, this normalizes using mean and/or variance. Without seeing the entire dataset, how can it know these things? It must therefore wait for the entire input before operating, and hence can't be run in parallel with the stages after it. Most (if not nearly all) stages are like that.

However, in some cases, you still can use multicores with sklearn.

Some stages have an n_jobs parameter. Stages like this use sequentially relative to other stages, but can parallelize the work within.
In some cases you can roll your own (approximate) parallel versions of other stages. E.g., given any regressor stage, you can wrap it in a stage that randomly chunks your data into n parts, learns the parts in parallel, and outputs a regressor that is the average of all the regressors. YMMV.

来源：https://stackoverflow.com/questions/43785067/pararelization-of-sklearn-pipeline

标签

python

multithreading

scikit-learn

pipeline

amazon-data-pipeline