How to split data into trainset and testset randomly?

后端 未结 9 1260
花落未央
花落未央 2020-12-07 16:27

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example.

相关标签:
9条回答
  • 2020-12-07 16:59

    Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users

    0 讨论(0)
  • 2020-12-07 17:10

    You can try this approach

    import pandas
    import sklearn
    csv = pandas.read_csv('data.csv')
    train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)
    

    UPDATE: train_test_split was moved to model_selection so the current way (scikit-learn 0.22.2) to do it is this:

    import pandas
    import sklearn
    csv = pandas.read_csv('data.csv')
    train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
    
    0 讨论(0)
  • 2020-12-07 17:11

    This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

    import random
    
    with open("datafile.txt", "rb") as f:
        data = f.read().split('\n')
    
    random.shuffle(data)
    
    train_data = data[:50]
    test_data = data[50:]
    
    0 讨论(0)
  • 2020-12-07 17:14
    from sklearn.model_selection import train_test_split
    import numpy
    
    with open("datafile.txt", "rb") as f:
       data = f.read().split('\n')
       data = numpy.array(data)  #convert array to numpy type array
    
       x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)
    
    0 讨论(0)
  • 2020-12-07 17:21

    sklearn.cross_validation is deprecated since version 0.18, instead you should use sklearn.model_selection as show below

    from sklearn.model_selection import train_test_split
    import numpy
    
    with open("datafile.txt", "rb") as f:
       data = f.read().split('\n')
       data = numpy.array(data)  #convert array to numpy type array
    
       x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)
    
    0 讨论(0)
  • 2020-12-07 17:22

    A quick note for the answer from @subin sahayam

     import random
     file=open("datafile.txt","r")
     data=list()
     for line in file:
        data.append(line.split(#your preferred delimiter))
     file.close()
     random.shuffle(data)
     train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
     test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
    

    If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.

    test_data = data[int(len(data)*.80+1):]

    0 讨论(0)
提交回复
热议问题