Comparison of R, statmodels, sklearn for a classification task with logistic regression

前端 未结 2 801
感情败类
感情败类 2020-12-14 21:38

I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with

相关标签:
2条回答
  • 2020-12-14 22:08

    I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.

    Results

    Number of 'default' values : 333
    Intercept: [-6.12556565]
    Coefficients: [[ 2.73145133  0.27750788]]
    
    Confusion matrix
    [[9629   38]
     [ 225  108]]
    
    Score          0.9737
    Precision      0.7397
    Recall         0.3243
    

    Code

    # scikit-learn vs. R
    # http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg
    
    import pandas as pd
    import sklearn
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix
    from sklearn import preprocessing
    
    # Data is available here.
    Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)
    
    Default['default'] = Default['default'].map({'No':0, 'Yes':1})
    Default['student'] = Default['student'].map({'No':0, 'Yes':1})
    
    I = Default['default'] == 0
    print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))
    
    feats = ['balance', 'income']
    
    Default[feats] = preprocessing.scale(Default[feats])
    
    # C = 1e6 ~ no regularization.
    classifier = LogisticRegression(C = 1e6, random_state = 42) 
    
    classifier.fit(Default[feats], Default['default'])  #fit classifier on whole base
    print("Intercept: {0}".format(classifier.intercept_))
    print("Coefficients: {0}".format(classifier.coef_))
    
    y_true = Default['default']
    y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5
    
    confusion = confusion_matrix(y_true, y_pred_cls)
    score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
    precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
    recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
    print("\nConfusion matrix")
    print(confusion)
    print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))
    
    0 讨论(0)
  • 2020-12-14 22:25

    Although this post is old, I wanted to give you a solution. In your post you are comparing apples with oranges. In your R code, you are estimating "balance, income, and student" on "default". In your Python code, you are only estimating "balance and income" on "default". Of course, you cannot get the same estimates. Also the differences cannot be attributed to feature scaling, as logistic regression usually does not need it in comparison to kmeans.

    You are right to set a high C, so that there is no regularization. If you want to have the same output as in R, you have to change the solver to "newton-cg". Different solvers can give different results but they still yield the same objective value. As long as your solver converge everything will be okay.

    Here's the code that give you the same estimates like in R and Statsmodels:

    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    from patsy import dmatrices # 
    import numpy as np
    
     # data is available here
    Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0)
     #
    Default['default']=Default['default'].map({'No':0, 'Yes':1})
    Default['student']=Default['student'].map({'No':0, 'Yes':1})
    
    # use dmatrices to get data frame for logistic regression
    y, X = dmatrices('default ~ balance+income+C(student)',
                      Default,return_type="dataframe")
    
    y = np.ravel(y)
    
    # fit logistic regression
    model = LogisticRegression(C = 1e6, fit_intercept=False, solver = "newton-cg", max_iter=10000000)
    model = model.fit(X, y)
    
    # examine the coefficients
    pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
    
    0 讨论(0)
提交回复
热议问题