I have a model I\'m trying to build using LogisticRegression
in sklearn
that has a couple thousand features and approximately 60,000 samples. I\'m try
The default solver for LogisticRegressin
in sklearn is liblinear
which is a suitable solver for normal datasets. For large datasets try the stochastic gradient descent solvers such as sag
:
model = LogisticRegression(solver='sag')
Try reducing data set size and changing tolerance parameter. For example you can try classifier = LogisticRegression(tol = 0.1)
Worth noting that now LogisticRegression() accepts num_jobs as input and defaults to 1.
Would have commented on the accepted answer, but not enough points.
In current version of scikit-learn, LogisticRegression() now has n_jobs
parameter to utilize multiple cores.
However, the actual text of the user guide suggests that multiple cores are still only being utilized during the second half of the computation. As of this update, the revised user guide for LogisticRegression
now says that njobs
chooses the "Number of CPU cores used during the cross-validation loop" whereas the other two items cited in the original response, RandomForestClassifier()
and RandomForestRegressor()
, both state that njobs
specifies "The number of jobs to run in parallel for both fit and predict". In other words, the deliberate contrast in phrasing here seems to be pointing out that the njobs
parameter in LogisticRegression()
, while now implemented, is not really implemented as completely, or in the same way, as in the other two cases.
Thus, while it may now be possible to speed up LogisticRegression()
somewhat by using multiple cores, my guess is that it probably won't be very linear in proportion to the number of cores used, as it sounds like the initial "fit" step (the first half of the algorithm) may not lend itself well to parallelization.
To my eye, it looks like the major issue here isn't memory, it's that you are only using one core. According to top, you are loading the system at 4.34%. If your logistic regression process is monopolizing 1 core out of 24, then that comes out to 100/24 = 4.167%. Presumably the remaining 0.17% accounts for whatever other processes you are also running on the machine, and they are allowed to take up an extra 0.17% because they are being scheduled by the system to run in parallel on a 2nd, different core.
If you follow the links below and look at the scikit-learn API, you'll see that some of the ensemble methods such as RandomForestClassifier() or RandomForestRegressor() have an input parameter called n_jobs
which directly controls the number of cores on which the package will attempt to run in parallel. The class that you are using, LogisticRegression() doesn't define this input. The designers of scikit-learn seem to have created an interface which is generally pretty consistent between classes, so if a particular input parameter is not defined for a given class, it probably means that the developers simply could not figure out a way to implement the option in a meaningful way for that class. It may be the case that the logistic regression algorithm simply doesn't lend itself well to parallelization; i.e., the potential speedup that could have been achieved just wasn't good enough to have justified implementing it with a parallel architecture.
Assuming that this is the case, then no, there's not much you can do to make your code go faster. 24 cores doesn't help you if the underlying library functions simply weren't designed to be able to take advantage of them.