How to split data into trainset and testset randomly?

后端未结

关注

 9  1261

花落未央

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

相关标签:

9条回答

自闭症患者

2020-12-07 17:25

You could also use numpy. When your data is stored in a numpy.ndarray:

import numpy as np
from random import sample
l = 100 #length of data 
f = 50  #number of elements you need
indices = sample(range(l),f)

train_data = data[indices]
test_data = np.delete(data,indices)

0 讨论(0)

心在旅途

2020-12-07 17:25

To answer @desmond.carros question, I modified the best answer as follows,

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set

The code splits the entire dataset to 80% train and 20% test data

0 讨论(0)

-上瘾入骨i

2020-12-07 17:26

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input
    random.seed=myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_part=int(math.ceil(len(data)/float(k)))

    # Create one partition per fold
    train={}
    test={}
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]

    return train, test

0 讨论(0)

上一页 1 2