Sklearn - Cannot use encoded data in Random forest classifier

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I'm new to scikit-learn. I'm trying use preprocessing. OneHotEncoder to encode my training and test data. After encoding I tried to train Random forest classifier using that data. But I get the following error when fitting. (Here the error trace)

    99         model.fit(X_train, y_train)     100         preds = model.predict_proba(X_cv)[:, 1]     101   C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)     288      289         # Precompute some data --> 290         X, y = check_arrays(X, y, sparse_format="dense")     291         if (getattr(X, "dtype", None) != DTYPE or     292                 X.ndim != 2 or  C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)     200                     array = array.tocsc()     201                 elif sparse_format == 'dense': --> 202                     raise TypeError('A sparse matrix was passed, but dense '     203                                     'data is required. Use X.toarray() to '     204                                     'convert to a dense numpy array.')  TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I tried to convert the sparse matrix into dense using X.toarray() and X.todense() But when I do that, I get the following error trace.

 99         model.fit(X_train.toarray(), y_train)     100         preds = model.predict_proba(X_cv)[:, 1]     101   C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self)     548      549     def toarray(self): --> 550         return self.tocoo(copy=False).toarray()     551      552     ##############################################################  C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self)     236      237     def toarray(self): --> 238         B = np.zeros(self.shape, dtype=self.dtype)     239         M,N = self.shape     240         coo_todense(M, N, self.nnz, self.row, self.col, self.data, B.ravel())  ValueError: array is too big.

Can anyone help me to fix this.

Thank you

回答1:

sklearn random forests do not work on sparse input and your dataset shape is to large and too sparse for a dense version to fit in memory.

You probably have some categorical features with a much to large cardinality (for instance a free text field or unique entry ids). Try to drop those features and start over.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!