问题
This is a question about scikit learn (version 0.17.0) in Python 2.7 along with Pandas 0.17.1. In order to split raw data (with no missing entries) using the approach detailed here, I have found that if the split data are used to proceed with a .fit()
that there is an error that appears.
Here is the code taken largely unchanged from the other stackoverflow question with renaming of variables. I have then instantiated a grid and tried to fit the split data with the aim of determining optimal classifier parameters. The error occurs after the last line of the code below:
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
y = wine['quality']
X = wine.drop(['quality','color'],axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(y, n_iter=3, test_size=0.2)
# Split dataset to obtain indices for train and test set
for train_index, test_index in sss:
xtrain, xtest = X.iloc[train_index], X.iloc[test_index]
ytrain, ytest = y[train_index], y[test_index]
# Pick some classifier here
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
from sklearn.grid_search import GridSearchCV
# Instantiate grid
grid = GridSearchCV(decision_tree, param_grid={'max_depth':np.arange(1,3)}, cv=sss, scoring='accuracy')
# this line causes the error message
grid.fit(xtrain,ytrain)
Here is the error message produced by the above code:
Traceback (most recent call last):
File "C:\Python27\test.py", line 23, in <module>
grid.fit(xtrain,ytrain)
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
self.results = batch()
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1524, in _fit_and_score
X_train, y_train = _safe_split(estimator, X, y, train)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1591, in _safe_split
X_subset = safe_indexing(X, indices)
File "C:\Python27\lib\site-packages\sklearn\utils\__init__.py", line 152, in safe_indexing
return X.iloc[indices]
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1227, in __getitem__
return self._getitem_axis(key, axis=0)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1504, in _getitem_axis
self._is_valid_list_like(key, axis)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1443, in _is_valid_list_like
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
NOTE:
It is important to me to keep X
and y
as Pandas datastructures, similar to the second approach presented in the other stackoverflow question above. i.e. I would not want to use X.values
and y.values
.
Question:
Using the raw data as a Pandas datastructure (DataFrame
for X
and Series
for y
), is there a way to run grid.fit()
without getting this error message?
回答1:
You should pass X
and y
directly to fit()
, like
grid.fit(X, y)
and GridSearchCV
will take care of
xtrain, xtest = X.iloc[train_index], X.iloc[test_index]
ytrain, ytest = y[train_index], y[test_index]
The StratifiedShuffleSplit
instance, when iterated over, yields pairs of train/test split indices:
>>> list(sss)
[(array([2531, 4996, 4998, ..., 3205, 2717, 4983]), array([5942, 893, 1702, ..., 6340, 4806, 2537])),
(array([1888, 2332, 6276, ..., 1674, 775, 3705]), array([3404, 3304, 4741, ..., 4397, 3646, 1410])),
(array([1517, 3759, 4402, ..., 5098, 4619, 4521]), array([1110, 4076, 1280, ..., 6384, 1294, 1132]))]
GridSearchCV
will use these indices to split the training samples. There is no need for you to do it manually.
The error occurs because you are feeding xtrain
and ytrain
(one of the train/test splits) into the cross-validator. The cross-validator tries to access items which exist in the full dataset but not in the train/test split, which raises an IndexError
.
来源:https://stackoverflow.com/questions/35998112/sklearn-grid-fitx-y-error-positional-indexers-are-out-of-bounds-for-x-tra