Increasing n_jobs has no effect on GridSearchCV

后端 未结 1 1140
清歌不尽
清歌不尽 2021-01-05 01:24

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results

相关标签:
1条回答
  • 2021-01-05 02:04

    Here are some reasons which might be a cause of this behaviour

    • With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
      • when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
      • when n_job=3, time was 3.4s (overall time 1.4 mins)
      • when n_job=4, time was 3.8s (overall time 58 secs)
      • when n_job=5, time was 4.2s (overall time 51 secs)
      • when n_job=6, time was 4.2s (overall time ~49 secs)
      • when n_job=7, time was 4.2s (overall time ~49 secs)
      • when n_job=8, time was 4.2s (overall time ~49 secs)
    • Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.

    • Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.

    • I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia

    • Finally, it might be due to the data size and the complexity of the model you use for training as well.

    Here is a blog post explaining the same issue regarding multithreading.

    0 讨论(0)
提交回复
热议问题