Numpy: How to randomly split/select an matrix into n-different matrices

前端 未结 4 751
攒了一身酷
攒了一身酷 2021-02-05 16:00
  • I have a numpy matrix with shape of (4601, 58).
  • I want to split the matrix randomly as per 60%, 20%, 20% split based on number of rows
  • This is for Machin
相关标签:
4条回答
  • 2021-02-05 16:20

    If you want to randomly select rows, you could just use random.sample from the standard Python library:

    import random
    
    population = range(4601) # Your number of rows
    choice = random.sample(population, k) # k being the number of samples you require
    

    random.sample samples without replacement, so you don't need to worry about repeated rows ending up in choice. Given a numpy array called matrix, you can select the rows by slicing, like this: matrix[choice].

    Of, course, k can be equal to the number of total elements in the population, and then choice would contain a random ordering of the indices for your rows. Then you can partition choice as you please, if that's all you need.

    0 讨论(0)
  • 2021-02-05 16:20

    Since you need it for machine learning, here is a method I wrote:

    import numpy as np
    
    def split_random(matrix, percent_train=70, percent_test=15):
        """
        Splits matrix data into randomly ordered sets 
        grouped by provided percentages.
    
        Usage:
        rows = 100
        columns = 2
        matrix = np.random.rand(rows, columns)
        training, testing, validation = \
        split_random(matrix, percent_train=80, percent_test=10)
    
        percent_validation 10
        training (80, 2)
        testing (10, 2)
        validation (10, 2)
    
        Returns:
        - training_data: percentage_train e.g. 70%
        - testing_data: percent_test e.g. 15%
        - validation_data: reminder from 100% e.g. 15%
        Created by Uki D. Lucas on Feb. 4, 2017
        """
    
        percent_validation = 100 - percent_train - percent_test
    
        if percent_validation < 0:
            print("Make sure that the provided sum of " + \
            "training and testing percentages is equal, " + \
            "or less than 100%.")
            percent_validation = 0
        else:
            print("percent_validation", percent_validation)
    
        #print(matrix)  
        rows = matrix.shape[0]
        np.random.shuffle(matrix)
    
        end_training = int(rows*percent_train/100)    
        end_testing = end_training + int((rows * percent_test/100))
    
        training = matrix[:end_training]
        testing = matrix[end_training:end_testing]
        validation = matrix[end_testing:]
        return training, testing, validation
    
    # TEST:
    rows = 100
    columns = 2
    matrix = np.random.rand(rows, columns)
    training, testing, validation = split_random(matrix, percent_train=80, percent_test=10) 
    
    print("training",training.shape)
    print("testing",testing.shape)
    print("validation",validation.shape)
    
    print(split_random.__doc__)
    
    • training (80, 2)
    • testing (10, 2)
    • validation (10, 2)
    0 讨论(0)
  • 2021-02-05 16:26

    A complement to HYRY's answer if you want to shuffle consistently several arrays x, y, z with same first dimension: x.shape[0] == y.shape[0] == z.shape[0] == n_samples.

    You can do:

    rng = np.random.RandomState(42)  # reproducible results with a fixed seed
    indices = np.arange(n_samples)
    rng.shuffle(indices)
    x_shuffled = x[indices]
    y_shuffled = y[indices]
    z_shuffled = z[indices]
    

    And then proceed with the split of each shuffled array as in HYRY's answer.

    0 讨论(0)
  • 2021-02-05 16:33

    you can use numpy.random.shuffle

    import numpy as np
    
    N = 4601
    data = np.arange(N*58).reshape(-1, 58)
    np.random.shuffle(data)
    
    a = data[:int(N*0.6)]
    b = data[int(N*0.6):int(N*0.8)]
    c = data[int(N*0.8):]
    
    0 讨论(0)
提交回复
热议问题