I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example.
Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users
You can try this approach
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)
UPDATE: train_test_split
was moved to model_selection
so the current way (scikit-learn 0.22.2) to do it is this:
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
import random
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
random.shuffle(data)
train_data = data[:50]
test_data = data[50:]
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
sklearn.cross_validation
is deprecated since version 0.18, instead you should use sklearn.model_selection
as show below
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
A quick note for the answer from @subin sahayam
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.
test_data = data[int(len(data)*.80+1):]