问题
I have started to use scikit learn for text extraction. When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem.
This is my pipeline:
pipeline = Pipeline([('feats', FeatureUnion([
('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])),
('addned', AddNed()),])), ('clf', SGDClassifier()),])
This is my class AddNEd which add 30 news features on each documents (sample).
class AddNed(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def transform (self, X, **transform_params):
do_something
x_new_feat = np.array(list_feat)
print(type(X))
X_np = np.array(X)
print(X_np.shape, x_new_feat.shape)
return np.concatenate((X_np, x_new_feat), axis = 1)
def fit(self, X, y=None):
return self
And the first part of my main programm
data = load_files('HO_without_tag')
grid_search = GridSearchCV(pipeline, parameters, n_jobs = 1, verbose = 20)
print(len(data.data), len(data.target))
grid_search.fit(X, Y).transform(X)
But I get this result:
486 486
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV]feats__ngram_tfidf__vect__max_features=3000....
323
<class 'list'>
(323,) (486, 30)
And of course a Indexerror Exception
return np.concatenate((X_np, x_new_feat), axis = 1)
IndexError: axis 1 out of bounds [0, 1
When I have the params X in transform function (class AddNed) why I don't have a numpy array (486, 3000) shape for X. I have only (323,) shape. I don't understand because if I delete Feature Union and AddNed() pipeline, CountVectorizer and tf_idf work properly with the right features and the right shape. If anyone have an idea? Thanks a lot.
回答1:
OK, I will try to give more explication. When I say do_something, I say do_nothing with X. In the class AddNed if I rewrite :
def transform (self, X, **transform_params):
print(X.shape) #Print X shape on first line before do anything
print(type(X)) #For information
do_nothing_withX #Construct a new matrix with a shape (number of samples, 30 new features)
x_new_feat = np.array(list_feat) #Get my new matrix in numpy array
print(x_new_feat.shape)
return x_new_feat
In this transform case above, I do not concatenate X matrix and new matrix. I presume features union do that... And my result:
486 486 #Here it is a print (data.data, data.target)
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV] clf__alpha=1e-05, vect__max_df=0.1, clf__penalty=l2, feats__tfidf__use_idf=True, feats__tfidf__norm=l1, clf__loss=hinge, vect__ngram_range=(1, 1), clf__n_iter=10, vect__max_features=3000
(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>
(486, 30) # My new matrix shape
Traceback (most recent call last):
File "pipe_line_learning_union.py", line 134, in <module>
grid_search.fit(X, Y).transform(X)
.....
File "/data/maclearnVE/lib/python3.4/site-packages/scipy/sparse/construct.py", line 581, in bmat
raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
To go further, just to see, if if I put a cross validation on gridsearchCV, just to modify sample size:
grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs = 1, verbose = 20)
I have this result:
486 486
Fitting 2 folds for each of 3456 candidates, totalling 6912 fits
[CV] ......
(242, 3000) #This a new sample size due to cross validation
<class 'scipy.sparse.csr.csr_matrix'>
(486, 30)
..........
ValueError: blocks[0,:] has incompatible row dimensions
Of course if it is necessary, I can give all the code of do_nothing_withX. But what I don't understand, it is why sample size with the pipeline countvectorizer+tdf_idf it is not equal to the number of files load with sklearn.datasets.load_files() function.
回答2:
You've probably solved it by now, but someone else may have the same problem:
(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>
AddNed
tries to concatenate a matrix with a sparse matrix, the sparse matrix should be transformed to dense matrix first.
I've found the same error trying to use the result of CountVectorizer
来源:https://stackoverflow.com/questions/38456377/featureunion-in-scikit-klearn-and-incompatible-row-dimension