SMOTE is giving array size / ValueError for all-categorical dataset

问题

I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples.

While running the below code, I am getting the error:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
     16 print(X_new.shape) # (10500, 1)
     17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     81         )
     82 
---> 83         output = self._fit_resample(X, y)
     84 
     85         y_ = (label_binarize(output[1], np.unique(y))

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
    926 
    927         X_continuous = X[:, self.continuous_features_]
--> 928         X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
    929         X_minority = _safe_indexing(
    930             X_continuous, np.flatnonzero(y == class_minority)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    592                              " a minimum of %d is required%s."
    593                              % (n_features, array.shape, ensure_min_features,
--> 594                                 context))
    595 
    596     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.

Code:

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC

sm = SMOTENC(random_state=27,categorical_features=[0,])

X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())

print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)

X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew

print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)

If i understand correctly, the shape of X_new should be (n_samples, n_features) which is 10500 X 1. I am not sure why in the ValueError it is considering it as shape=(10500,0)

Can someone please help me here ?

回答1:

I have reproduced your issue adapting the example in the docs for a single categorical feature in the data:

from collections import Counter
from numpy.random import RandomState
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC

X, y = make_classification(n_classes=2, class_sep=2,
 weights=[0.1, 0.9], n_informative=1, n_redundant=0, flip_y=0,
 n_features=1, n_clusters_per_class=1, n_samples=1000, random_state=10)

# simulate the only column to be a categorical feature
X[:, 0] = RandomState(10).randint(0, 4, size=(1000))
X.shape
# (1000, 1)

sm = SMOTENC(random_state=42, categorical_features=[0,]) # same behavior with categorical_features=[0]

X_res, y_res = sm.fit_resample(X, y)

which gives the same error:

ValueError: Found array with 0 feature(s) (shape=(1000, 0)) while a minimum of 1 is required.

The reason is actually quite simple, but you have to dig a little to the original SMOTE paper; quoting from the relevant section (emphasis mine):

While our SMOTE approach currently does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuous and nominal features. We call this approach Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult dataset from the UCI repository. The SMOTE-NC algorithm is described below.

Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an amount that is related to the typical difference in continuous feature values.

Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being identified (minority class sample) and the other feature vectors (minority class samples) using the continuous feature space. For every differing nominal feature between the considered feature vector and its potential nearest-neighbor, include the median of the standard deviations previously computed, in the Euclidean distance computation.

In other words, and although not stated explicitly, it is apparent that, in order for the algorithm to work, it needs at least one continuous feature. This is not the case here, so the algorithm rather unsurprisingly fails.

I guess that, internally, during step 1 (median computation), the algorithm temporarily removes all categorical features from the data; in doing so here, it is faced indeed with a shape of (1000, 0) (or (10500, 0) in your case), i.e. no data, hence the specific reference in the error message.

So, there is not any actual programming issue here to be remedied, it's just that what you try to do is actually impossible with the SMOTE-NC algorithm (notice that the very initials NC in the algorithm name mean Nominal-Continuous).

来源：https://stackoverflow.com/questions/61824892/smote-is-giving-array-size-valueerror-for-all-categorical-dataset

标签

python

machine-learning

imbalanced-data

imblearn

smote