split data into training and testing

问题

I want to replicate this tutorial to classify two groups https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/ with different dataset but could not do that despite being hardly trying. I am new to programming so would appreciate any assistance or tips that could help.

My dataset is small (240 files for each group), and files named 01 - 0240.

It is around these lines of codes, I think.

    if is_trian and filename.startswith('cv9'):
        continue
    if not is_trian and not filename.startswith('cv9'):
        continue

and also these

            trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
            save_dataset([trainX,trainy], 'train.pkl')

            testY = [0 for _ in range(100)] + [1 for _ in range(100)]
            save_dataset([testX,testY], 'test.pkl')

two errors were encountered so far:

Input arrays should have the same number of samples as target arrays. Found 483 input samples and 200 target samples.

Unable to open file (unable to open file: name = 'model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

I would really appreciate any prompt help.

Thanks in advance.

// Part of the code for more clarity. //

# load all docs in a directory
def process_docs(directory, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any transcript in the test set

I want to add an argument below to indicate whether to process the training or testing files, just as mentioned in the tutorial. Or if there's another way please share it

        if is_trian and filename.startswith('----'):
            continue
        if not is_trian and not filename.startswith('----'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc)
        # add to list
        documents.append(tokens)
    return documents

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')

# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]

save_dataset([testX,testY], 'test.pkl')

回答1:

You should post more of your code, but it sounds like your problem is curating the data. Say you have 240 files in a folder called 'healthy' and 240 files in a folder called 'sick'. Then you need to label all the healthy people with label 0 and all the sick people with label 1. Try something like:

from glob import glob 
from sklearn.model_selection import train_test_split

#get the filenames for healthy people 
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]

#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]

#get the filenames of sick people
xsick    = [ fname for fname in glob( 'pathToData/sick/*')]

#give sick people label of 1
ysick    = [ 1 for i in range( len( xsick ))]

#combine the data 
xdata = xhealthy + xsick 
ydata = yhealthy + ysick 

#create the training and test set 
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)

Then train your models with X_train, Y_train and test it with X_test, Y_test - keeping in mind that your X_data are just file names that need to still need processing. The more code you post the more people can help with your question.

回答2:

I was able to solve the problem by separating the dataset into train and test sets manually and then labelling each set alone. My current dataset is so small, so I will keep looking for a better solution for large datasets once I have the capacity. Provided to close the question.

来源：https://stackoverflow.com/questions/56535955/split-data-into-training-and-testing

标签

python

machine-learning

training-data