logistic-regression | 易学教程

Large fixed effects binomial regression in R

阅读更多关于 Large fixed effects binomial regression in R

问题 I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm function because the regression matrix seems too large for my memory (I get the message " Error: cannot allocate vector of size 22.9 Gb "). I am looking for alternative ways

What is the inverse of regularization strength in Logistic Regression? How should it affect my code?

阅读更多关于 What is the inverse of regularization strength in Logistic Regression? How should it affect my code?

问题 I am using sklearn.linear_model.LogisticRegression in scikit learn to run a Logistic Regression. C : float, optional (default=1.0) Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. What does C mean here in simple terms please? What is regularization strength? 回答1: Regularization is applying a penalty to increasing the magnitude of parameter values in order to reduce overfitting. When you train a model

statmodels in python package, How exactly duplicated features are handled?

阅读更多关于 statmodels in python package, How exactly duplicated features are handled?

问题 I am a heavy R user and am recently learning python. I have a question about how statsmodels.api handles duplicated features. In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE). My question is which algorithm is statsmodels employ to obtain MLE? Especially how is the algorithm handling the situation with duplicated features? To clarify my question, I generate a sample of size 50 from

How does multinom() treat NA values by default?

阅读更多关于 How does multinom() treat NA values by default?

问题 When I am running multinom() , say Y ~ X1 + X2 + X3 , if for one particular row X1 is NA (i.e. missing), but Y , X2 and X3 all have a value, would this entire row be thrown out (like it does in SAS)? How are missing values treated in multinom() ? 回答1: Here is a simple example (from ?multinom from the nnet package) to explore the different na.action : > library(nnet) > library(MASS) > example(birthwt) > (bwt.mu <- multinom(low ~ ., bwt)) Intentionally create a NA value: > bwt[1,"age"]<-NA #

glmnet: How do I know which factor level of my response is coded as 1 in logistic regression

阅读更多关于 glmnet: How do I know which factor level of my response is coded as 1 in logistic regression

问题 I have a logistic regression model that I made using the glmnet package. My response variable was coded as a factor, the levels of which I will refer to as "a" and "b". The mathematics of logistic regression label one of the two classes as "0" and the other as "1". The feature coefficients of a logistic regression model are either positive, negative, or zero. If a feature "f"'s coefficient is positive, then increasing the value of "f" for a test observation x increases the probability that

Python scikit-learn to JSON

阅读更多关于 Python scikit-learn to JSON

问题 I have a model built with Python scikit-learn. I understand that the models can be saved in Pickle or Joblib formats. Are there any existing methods out there to save the jobs in JSON format? Please see the model build code below for reference: import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression import pickle url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names =['preg', 'plas

How can I use multi cores processing to run glm function faster

阅读更多关于 How can I use multi cores processing to run glm function faster

问题 I'm a bit new to r and I would like to use a package that allows multi cores processing in order to run glm function faster.I wonder If there is a syntax that I can use for this matter. Here is an example glm model that I wrote, can I add a parameter that will use multi cores ? g<-glm(IsChurn~.,data=dat,family='binomial') Thanks. 回答1: Other usefull packages are: http://cran.r-project.org/web/packages/gputools/gputools.pdf with gpuGlm and http://cran.r-project.org/web/packages/mgcv/mgcv.pdf

Controlling the threshold in Logistic Regression in Scikit Learn

阅读更多关于 Controlling the threshold in Logistic Regression in Scikit Learn

问题 I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto . I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes. Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs? I did not find anything in the documentation page. Does it by default apply the 0.5 value as

Comparison of R, statmodels, sklearn for a classification task with logistic regression

阅读更多关于 Comparison of R, statmodels, sklearn for a classification task with logistic regression

问题 I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different. I understand that it is probably not the same optimization algorithms that are used under the wood. Specifically, I use the standard Default dataset (used in the ISL book). The following Python code reads the data into a dataframe

ValueError: Unknown label type: 'unknown'

阅读更多关于 ValueError: Unknown label type: 'unknown'

问题 I try to run following code. Btw, I am new to both python and sklearn. import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression # data import and preparation trainData = pd.read_csv('train.csv') train = trainData.values testData = pd.read_csv('test.csv') test = testData.values X = np.c_[train[:, 0], train[:, 2], train[:, 6:7], train[:, 9]] X = np.nan_to_num(X) y = train[:, 1] Xtest = np.c_[test[:, 0:1], test[:, 5:6], test[:, 8]] Xtest = np.nan_to_num(Xtest) #