Pybind11 Parallel-Processing Issue in Concurrency::parallel_for

问题

I have a python code that performs filtering on a matrix. I have created a C++ interface using pybind11 that successfully runs in serialized fashion (please see the code in below).

I am trying to make it parallel-processing to hopefully reduce the computation time compared to its serialized version. To do this, I have splitted my array of size M×N into three sub-matrices of size M×(N/3) to process them in parallel using the same interface.

I used ppl.h library to make a parallel for-loop and in each loop call the python function on a sub-matrix of size M×(N/3).

#include <iostream>
#include <ppl.h>

#include "pybind11/embed.h"
#include <pybind11/iostream.h>
#include <pybind11/stl_bind.h>
#include "pybind11/eigen.h"
#include "pybind11/stl.h"
#include "pybind11/numpy.h"
#include "pybind11/functional.h"
#include <Eigen/Dense>

namespace py = pybind11;

class myClass
{
public:
    myClass()
    {
        m_module = py::module::import("myFilterScript");
        m_handle = m_module.attr("medianFilter");
    };

    void medianFilterSerialized(Eigen::Ref<Eigen::MatrixXf> input, int windowSize) 
    {
        Eigen::MatrixXf output;
        output.resizeLike(input);
        output = m_handle(input, windowSize).cast<Eigen::MatrixXf>();
    };

    void medianFilterParallelizedUsingPPL(Eigen::Ref<Eigen::MatrixXf> input, int windowSize) 
    {
        Eigen::MatrixXf output;
        output.resizeLike(input);
        /* Acquire GIL before calling Python code */
        //py::gil_scoped_acquire acquire;
        Concurrency::parallel_for(size_t(0), size_t(3), [&](size_t i)
        {
            output.block(0, i * input.cols() / 3, input.rows(), input.cols() / 3) = m_handle(input.block(0, i * input.cols() / 3, input.rows(), input.cols() / 3).array(), windowSize).cast<Eigen::MatrixXf>();
        });
        //py::gil_scoped_release release;
    };

private:
    py::scoped_interpreter m_guard;
    py::module m_module;
    py::handle m_handle;
    py::object m_object;
};


int main()
{
    myClass c;

    Eigen::MatrixXf input = Eigen::MatrixXf::Random(240, 120);

    c.medianFilterSerialized(input, 3); 
    c.medianFilterParallelizedUsingPPL(input, 3);

    return 0;
}

myFilterScript.py:

import threading
import numpy as np
import bottleneck as bn # can be installed from https://pypi.org/project/Bottleneck/

def medianFilter(input, windowSize):
    return bn.move_median(input, window=windowSize, axis=0)

Regardsless of using py::gil_scoped_acquire my code crashes when it reached to the for-loop:

Access violation reading location // or:
Unhandled exception at 0x00007FF98BB8DB8E (ucrtbase.dll) in Pybind11_Parallelizing.exe: Fatal program exit requested.

Could someone kindly help me understand if a loaded function of a python module can be called in parallel either in multiprocessing or multithreading fashion? What am I missing in my code? Please let me know. Thanks in advance.

回答1:

py::gil_scoped_acquire is a RAII object to acquire the GIL within a scope, similarly, py::gil_scoped_release in an "inverse" RAII to release the GIL within a scope. Thus, within the relevant scope, you only need the former.

The scope to acquire the GIL on is on the function that calls Python, thus inside the lambda that you pass to parallel_for: each thread that executes needs to hold the GIL for accessing any Python objects or APIs, in this case m_handle. Doing so in the lambda, however, fully serializes the code, making the use of threads moot, so it would fix your problem for the wrong reasons.

This would be a case for using sub-interpreters for which there is no direct support in pybind11 (https://pybind11.readthedocs.io/en/stable/advanced/embedding.html#sub-interpreter-support), so the C API would be the ticket (https://docs.python.org/3/c-api/init.html#c.Py_NewInterpreter). Point being that the data operated on is non-Python and all operations are in principle independent.

However, you would need to know whether Bottleneck is thread safe. From a cursory look, it appears that it is as it has no global/static data AFAICT. In theory, there is then some room for parallelization: you need to hold the GIL when calling move_median when it enters the Cython code used to bind Bottleneck (it unboxes the variables, thus calling Python APIs), then Cython can release the GIL when entering the C code of Bottleneck and re-acquire on exit, followed by a release in the lambda when the RAII scope ends. The C code then runs in parallel.

But then the question becomes: why are you calling a C library from C++ through its Python bindings in the first place? Seems a trivial solution here: skip Python and call the move_median C function directly.

来源：https://stackoverflow.com/questions/60609340/pybind11-parallel-processing-issue-in-concurrencyparallel-for

标签

c++

multiprocessing

python-multiprocessing

pybind11

gil