thread.lock during custom parameter search class using Dask distributed

问题

I wrote my own parameter search implementation mostly due to the fact that I don't need cross-validation of GridSearch and RandomizedSearch of scikit-learn.

I use dask to deliver optimal distributed performance.

Here is what I have:

from scipy.stats import uniform
class Params(object):
    def __init__(self,fixed,loc=0.0,scale=1.0):
        self.fixed=fixed
        self.sched=uniform(loc=loc,scale=scale)

    def _getsched(self,i,size):
        return self.sched.rvs(size=size,random_state=i)

    def param(self,i,size=None):
        tmp=self.fixed.copy()
        if size is None:
            size=tmp['niter']
        tmp.update({'schd':self._getsched(i,size)})
        return tmp    

class Mymodel(object):
    def __init__(self,func,params_object,score,ntries,client):
        self.params=params_object
        self.func=func
        self.score=score
        self.ntries=ntries
        self.client=client

    def _run(self,params,train,test):
        return self.func(params,train,test,self.score)

    def build(self,train,test):
        res=[]
        for i in range(self.ntries):
            cparam=self.params.param(i)
            res.append( (cparam, self.client.submit(self._run, cparam, train,test)) )
        self._results=res
        return res

    def compute_optimal(self,res=None):
        from operator import itemgetter
        if res is None:
            res=self._results
        self._sorted=sorted(self.client.compute(res),key=itemgetter(1))

        return self._sorted[0]


def score(test,correct):
    return np.linalg.norm(test-correct)

def myfunc(params,ldata,data,score):
    schd=params['schd']
    niter=len(schd)
    #here I do some magic after which ldata is changing
    return score(test=ldata,correct=data)

After I start dask.distributed:

from distributed import Client
scheduler_host='myhostname:8786'
cli=Client(scheduler_host)

I run it like this:

%%time
params=Params({'niter':50},loc=1.0e-06,scale=1.0)
model=Mymodel(myfunc,params,score,100,cli)
ptdata=bad_data_example.copy()
graph=model.build(ptdata,good_data)

And get this:

distributed.protocol.pickle - INFO - Failed to serialize
<bound method Mymodel._run of <__main__.Mymodel object at 0x2b8961903050>>.
Exception: can't pickle thread.lock objects

Could you please help me to understand what is going on and how to fix this?

I'm also curious about the way how I find the minimum within all the parameters results. Is there a better way to do it with Dask?

I wrote this code fairly fast and never tried it in serial. I'm learning Dask together with many other topics (machine learning, gpu programming, Numba, Python OOP and etc.) so this code is not optimal by any means...

P.S. To actually execute it I use this call: model.compute_optimal(). Haven't got here yet - due to the error above.

回答1:

It looks like the main issue was due to the fact that I tried to map a method of a function. I had similar issues with joblib as well. So I re-coded the problem and removed all the classes.

The following issues regarding optimization are posted here: Parameter search using dask

I'll definetely use dask-searchcv in my work - when I'll need cross-validation - but for now it's really only a simple search for an optimal solution - so had to create my own implementation...

来源：https://stackoverflow.com/questions/44559237/thread-lock-during-custom-parameter-search-class-using-dask-distributed

标签

python

parallel-processing

cross-validation

dask

hyperparameters