How to split data into trainset and testset randomly?

后端 未结 9 1261
花落未央
花落未央 2020-12-07 16:27

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

相关标签:
9条回答
  • 2020-12-07 17:25

    You could also use numpy. When your data is stored in a numpy.ndarray:

    import numpy as np
    from random import sample
    l = 100 #length of data 
    f = 50  #number of elements you need
    indices = sample(range(l),f)
    
    train_data = data[indices]
    test_data = np.delete(data,indices)
    
    0 讨论(0)
  • 2020-12-07 17:25

    To answer @desmond.carros question, I modified the best answer as follows,

     import random
     file=open("datafile.txt","r")
     data=list()
     for line in file:
        data.append(line.split(#your preferred delimiter))
     file.close()
     random.shuffle(data)
     train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
     test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set
    

    The code splits the entire dataset to 80% train and 20% test data

    0 讨论(0)
  • 2020-12-07 17:26

    The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

    import random, math
    
    def k_fold(myfile, myseed=11109, k=3):
        # Load data
        data = open(myfile).readlines()
    
        # Shuffle input
        random.seed=myseed
        random.shuffle(data)
    
        # Compute partition size given input k
        len_part=int(math.ceil(len(data)/float(k)))
    
        # Create one partition per fold
        train={}
        test={}
        for ii in range(k):
            test[ii]  = data[ii*len_part:ii*len_part+len_part]
            train[ii] = [jj for jj in data if jj not in test[ii]]
    
        return train, test      
    
    0 讨论(0)
提交回复
热议问题