How to split data into trainset and testset randomly?

后端未结

关注

 9  1260

花落未央

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

相关标签:

9条回答

广开言路

2020-12-07 16:59

Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users

0 讨论(0)
发布评论:

提交评论
- 加载中...

隐瞒了意图╮

2020-12-07 17:10

You can try this approach

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

UPDATE: train_test_split was moved to model_selection so the current way (scikit-learn 0.22.2) to do it is this:

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)

0 讨论(0)

深忆病人

2020-12-07 17:11
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
```
import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

伪装坚强ぢ

2020-12-07 17:14

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

0 讨论(0)

攒了一身酷

2020-12-07 17:21

sklearn.cross_validation is deprecated since version 0.18, instead you should use sklearn.model_selection as show below

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

0 讨论(0)

猫巷女王i

2020-12-07 17:22
A quick note for the answer from @subin sahayam
```
 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
```
If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.

test_data = data[int(len(data)*.80+1):]
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页