问题
Originally, I read the data from a .csv
file, but here I build the dataframe from lists so the problem can be reproduced. The aim is to train a logistic regression model with cross-validation using LogisticRegressionCV
.
indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F']
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
data = [indeps, dep]
cols = ['state', 'cat_bins']
data_dict = dict((x[0], x[1]) for x in zip(cols, data))
df = pd.DataFrame.from_dict(data_dict)
df.tail()
cat_bins state
45 0.0 F
46 0.0 M
47 0.0 M
48 0.0 F
49 0.0 F
'''Use Pandas' to encode independent variables. Notice that
we are returning a sparse dataframe '''
def heat_it2(dataframe, lst_of_columns):
dataframe_hot = pd.get_dummies(dataframe,
prefix = lst_of_columns,
columns = lst_of_columns, sparse=True,)
return dataframe_hot
train_set_hot = heat_it2(df, ['state'])
train_set_hot.head(2)
cat_bins state_F state_M
0 1.0 0 1
1 1.0 1 0
'''Use the dataframe to set up the prospective inputs to the model as numpy arrays'''
indeps_hot = ['state_F', 'state_M']
X = train_set_hot[indeps_hot].values
y = train_set_hot['cat_bins'].values
print 'X-type:', X.shape, type(X)
print 'y-type:', y.shape, type(y)
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__')
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__')
print 'X does have attribute fit:\n',hasattr(X, 'fit')
print 'y does have attribute fit:\n',hasattr(y, 'fit')
X-type: (50, 2) <type 'numpy.ndarray'>
y-type: (50,) <type 'numpy.ndarray'>
X has shape, is an array and has length:
True True True
yhas shape, is an array and has length:
True True True
X does have attribute fit:
False
y does have attribute fit:
False
So, the inputs to the regressor seem to have the necessary properties for the .fit
method. They are numpy arrays wit the right shape. X
is an array with the dimensions [n_samples, n_features]
, and y
is a vector with shape [n_samples,]
Here is the documentation:
fit(X, y, sample_weight=None)[source]
Fit the model according to the given training data. Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape (n_samples,) Target vector relative to X.
....
Now we try to fit the regressor:
logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2')
logmodel.fit(X, y)
...
TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
The source of the error message seems to be in scikits' validation.py module, here.
The only section of the code that raises this error message is the following function-snippet:
def _num_samples(x):
"""Return number of samples in array-like x."""
if hasattr(x, 'fit'):
# Don't get num_samples from an ensembles length!
raise TypeError('Expected sequence or array-like, got '
'estimator %s' % x)
etc.
Question: Since the parameters with which we are fitting the model(X
and y
) do not have the attribute 'fit', why is this error message raised
Using python 2.7 on Canopy 1.7.4.3348 (64 bit) with scikit-learn 18.01-3 and pandas 0.19.2-2
Thank you for your help:)
回答1:
The problem seems to be in the scoring
argument. You have passed accuracy_score
. The signature of accuracy_score
is accuracy_score(y_true, y_pred[, ...])
. But in the module logistic.py
if isinstance(scoring, six.string_types):
scoring = SCORERS[scoring]
for w in coefs:
// Other code
if scoring is None:
scores.append(log_reg.score(X_test, y_test))
else:
scores.append(scoring(log_reg, X_test, y_test))
Since you have passed accuracy_score
, it doesnt fit the first line above.
And scores.append(scoring(log_reg, X_test, y_test))
is used to score the estimator. But as I said above, here the arguments doesnt match the required arguments of accuracy_score
. Hence the error.
Workaround:Use make_scorer(accuracy_score) in LogisticRegressionCV for scoring or simply pass the string 'accuracy'
logmodel = LogisticRegressionCV(Cs =1, dual=False ,
scoring = make_scorer(accuracy_score),
penalty = 'l2')
OR
logmodel = LogisticRegressionCV(Cs =1, dual=False ,
scoring = 'accuracy',
penalty = 'l2')
Note:
This maybe a bug on part of the logistic.py
module or in the documentation of LogisticRegressionCV they should have clarified the signature of scoring function.
You may submit an issue to the github and see how it goes Done
来源:https://stackoverflow.com/questions/42151921/array-like-input-for-sklearn-logisticregressioncv