resampling data - using SMOTE from imblearn with 3D numpy arrays

前端 未结 3 1893
别跟我提以往
别跟我提以往 2021-01-25 06:02

I want to resample my dataset. This consists in categorical transformed data with labels of 3 classes. The amount of samples per class are:

  • counts of class A: 6945
相关标签:
3条回答
  • 2021-01-25 06:28

    I am considering a dummy 3d array and assuming a 2d array size by myself,

    arr = np.random.rand(160, 10, 25)
    orig_shape = arr.shape
    print(orig_shape)
    

    Output: (160, 10, 25)

    arr = np.reshape(arr, (arr.shape[0], arr.shape[1]))
    print(arr.shape)
    

    Output: (4000, 10)

    arr = np.reshape(arr, orig_shape))
    print(arr.shape)
    

    Output: (160, 10, 25)

    0 讨论(0)
  • 2021-01-25 06:40

    I will create each point for a 2-dim array and then reshape it as 3 dim array. I have provided my scripts. If there is any confusion, comment; please reply.

    x_train, y_train = zip(*train_dataset)
    x_test, y_test = zip(*test_dataset)
    
    dim_1 = np.array(x_train).shape[0]
    dim_2 = np.array(x_train).shape[1]
    dim_3 = np.array(x_train).shape[2]
    
    new_dim = dim_1 * dim_2
    
    new_x_train = np.array(x_train).reshape(new_dim, dim_3)
    
    
    new_y_train = []
    for i in range(len(y_train)):
        # print(y_train[i])
        new_y_train.extend([y_train[i]]*dim_2)
    
    new_y_train = np.array(new_y_train)
    
    # transform the dataset
    oversample = SMOTE()
    X_Train, Y_Train = oversample.fit_sample(new_x_train, new_y_train)
    # summarize the new class distribution
    counter = Counter(Y_Train)
    print('The number of samples in TRAIN: ', counter)
    
    
    
    x_train_SMOTE = X_Train.reshape(int(X_Train.shape[0]/dim_2), dim_2, dim_3)
    
    y_train_SMOTE = []
    for i in range(int(X_Train.shape[0]/dim_2)):
        # print(i)
        value_list = list(Y_Train.reshape(int(X_Train.shape[0]/dim_2), dim_2)[i])
        # print(list(set(value_list)))
        y_train_SMOTE.extend(list(set(value_list)))
        ## Check: if there is any different value in a list 
        if len(set(value_list)) != 1:
            print('\n\n********* STOP: THERE IS SOMETHING WRONG IN TRAIN ******\n\n')
        
    
    
    dim_1 = np.array(x_test).shape[0]
    dim_2 = np.array(x_test).shape[1]
    dim_3 = np.array(x_test).shape[2]
    
    new_dim = dim_1 * dim_2
    
    new_x_test = np.array(x_test).reshape(new_dim, dim_3)
    
    
    new_y_test = []
    for i in range(len(y_test)):
        # print(y_train[i])
        new_y_test.extend([y_test[i]]*dim_2)
    
    new_y_test = np.array(new_y_test)
    
    # transform the dataset
    oversample = SMOTE()
    X_Test, Y_Test = oversample.fit_sample(new_x_test, new_y_test)
    # summarize the new class distribution
    counter = Counter(Y_Test)
    print('The number of samples in TEST: ', counter)
    
    
    
    x_test_SMOTE = X_Test.reshape(int(X_Test.shape[0]/dim_2), dim_2, dim_3)
    
    y_test_SMOTE = []
    for i in range(int(X_Test.shape[0]/dim_2)):
        # print(i)
        value_list = list(Y_Test.reshape(int(X_Test.shape[0]/dim_2), dim_2)[i])
        # print(list(set(value_list)))
        y_test_SMOTE.extend(list(set(value_list)))
        ## Check: if there is any different value in a list 
        if len(set(value_list)) != 1:
            print('\n\n********* STOP: THERE IS SOMETHING WRONG IN TEST ******\n\n')
    
    0 讨论(0)
  • 2021-01-25 06:41
    from imblearn.over_sampling 
    import RandomOverSampler 
    import numpy as np 
    oversample = RandomOverSampler(sampling_strategy='minority')
    

    X could be a time stepped 3D data like X[sample,time,feature], and y like binary values for each sample. For example: (1,1),(2,1),(3,1) -> 1

    X = np.array([[[1,1],[2,1],[3,1]],
                 [[2,1],[3,1],[4,1]],
                 [[5,1],[6,1],[7,1]],
                 [[8,1],[9,1],[10,1]],
                 [[11,1],[12,1],[13,1]]
                 ])
    
    y = np.array([1,0,1,1,0])
    

    There is no way to train OVERSAMPLER with 3D X values because if you use 2D you will get back 2D data.

    Xo,yo = oversample.fit_resample(X[:,:,0], y)
    Xo:
    [[ 1  2  3]
     [ 2  3  4]
     [ 5  6  7]
     [ 8  9 10]
     [11 12 13]
     [ 2  3  4]]
    
    yo:
    [1 0 1 1 0 0]
    

    but if you use 2D data (sample,time,0) to fit the model, it will give back indices, and it is enough to create 3D oversampled data

    oversample.fit_resample(X[:,:,0], y)
    Xo = X[oversample.sample_indices_]
    yo = y[oversample.sample_indices_]
    
    Xo:
    [[[ 1  1][ 2  1][ 3  1]]
     [[ 2  1][ 3  1][ 4  1]]
     [[ 5  1][ 6  1][ 7  1]]
     [[ 8  1][ 9  1][10  1]]
     [[11  1][12  1][13  1]]
     [[ 2  1][ 3  1][ 4  1]]]
    yo:
    [1 0 1 1 0 0]
    
    0 讨论(0)
提交回复
热议问题