问题
I am using Logistic Regression with the L1
norm (LASSO).
I have opted to used the glmnet
package in R
and the LogisticRegression()
from the sklearn.linear_model
in python
. From my understanding this should give the same results however they are not.
Note that I did not scale my data.
For python
I have used the below link as a reference:
https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/
and for R
I have used the below link:
http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/?fbclid=IwAR0ZTjoGqRgH5vNum9CloeGVaHdwlqDHwDdoGKJXwncOgIT98qUXGcnV70k
Here is the code used in R
###################################
#### LASSO LOGISTIC REGRESSION ####
##################################
x <- model.matrix(Y~., Train.Data.SubPop)[,-1]
y <- Train.Data.SubPop$Y
lambda_seq = c(0.0001, 0.01, 0.05, 0.0025)
cv_output <- cv.glmnet(x,y,alpha=1, family = "binomial", lambda = lambda_seq)
cv_output$lambda.min
lasso_best <- glmnet(x,y, alpha = 1, family = "binomial", lambda = cv_output$lambda.min)
Below is my Python code:
C = [0.001, 0.01, 0.05, 0.0025]
for c in C:
clf = LogisticRegression(penalty='l1', C=c, solver='liblinear')
clf.fit(X_train, y_train)
print('C:', c)
print('Coefficient of each feature:', clf.coef_)
print('Training accuracy:', clf.score(X_train_std, y_train))
print('Test accuracy:', clf.score(X_test_std, y_test))
print('')
When I exported the optimal value from the cv.glment()
function in R
it gave me that the optimal lambda is 0.0001
however, if I look at the analysis from python
the best accuracy/precision and recall came from 0.05
.
I have tried to fit the model with the 0.05 in R
and only 1 non-zero coefficient gave me but in phython
I had 7.
can someone help me understand why this discrepancies and difference pleasE?
Also, if someone can guide me how to replicate python
code in R
it would be very helpful!
回答1:
At a glance I see several issues:
Typo: Looking at your code, in R, your first
lambda
is0.0001
. In Python, your firstC
is0.001
.Different parameterization: Looking at the documentation, I think there's a clue in the names
lambda
in R andC
in Python being different. Inglmnet
, higher lambda means more shrinkage. However, in the sklearn docsC
is described as as "the inverse of regularaization strength... smaller values specify stronger regularization".Scaling: you say, "Note that I did not scale my data." This is incorrect. In R, you did. There is an
glmnet
argumentstandardize
for scaling the data, and the default isTRUE
. In Python, you didn't.Use of cross-validation. In R, you use
cv.glmnet
to do k-fold cross-validation on your training set. In Python, you useLogisticRegression
, notLogisticRegressionCV
, so there is no cross-validation. Note that cross-validation relies on random sampling, so if you do use CV in both, you should expect the results to be close, but not exact matches.
There are possibly other issues too.
来源:https://stackoverflow.com/questions/57855392/comparing-the-glmnet-output-of-r-with-python-using-logisticregression