Increasing n_jobs has no effect on GridSearchCV

后端未结

关注

 1  1140

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results

相关标签:

1条回答

天命终不由人

2021-01-05 02:04
Here are some reasons which might be a cause of this behaviour
- With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
  - when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
  - when n_job=3, time was 3.4s (overall time 1.4 mins)
  - when n_job=4, time was 3.8s (overall time 58 secs)
  - when n_job=5, time was 4.2s (overall time 51 secs)
  - when n_job=6, time was 4.2s (overall time ~49 secs)
  - when n_job=7, time was 4.2s (overall time ~49 secs)
  - when n_job=8, time was 4.2s (overall time ~49 secs)
- Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.
- Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
- I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia
- Finally, it might be due to the data size and the complexity of the model you use for training as well.
Here is a blog post explaining the same issue regarding multithreading.
0 讨论(0)
发布评论:

提交评论
- 加载中...