问题
I want to replicate this tutorial to classify two groups https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/ with different dataset but could not do that despite being hardly trying. I am new to programming so would appreciate any assistance or tips that could help.
My dataset is small (240 files for each group), and files named 01 - 0240.
It is around these lines of codes, I think.
if is_trian and filename.startswith('cv9'):
continue
if not is_trian and not filename.startswith('cv9'):
continue
and also these
trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
save_dataset([trainX,trainy], 'train.pkl')
testY = [0 for _ in range(100)] + [1 for _ in range(100)]
save_dataset([testX,testY], 'test.pkl')
two errors were encountered so far:
Input arrays should have the same number of samples as target arrays. Found 483 input samples and 200 target samples.
Unable to open file (unable to open file: name = 'model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I would really appreciate any prompt help.
Thanks in advance.
// Part of the code for more clarity. //
# load all docs in a directory
def process_docs(directory, is_trian):
documents = list()
# walk through all files in the folder
for filename in listdir(directory):
# skip any transcript in the test set
I want to add an argument below to indicate whether to process the training or testing files, just as mentioned in the tutorial. Or if there's another way please share it
if is_trian and filename.startswith('----'):
continue
if not is_trian and not filename.startswith('----'):
continue
# create the full path of the file to open
path = directory + '/' + filename
# load the doc
doc = load_doc(path)
# clean doc
tokens = clean_doc(doc)
# add to list
documents.append(tokens)
return documents
# save a dataset to file
def save_dataset(dataset, filename):
dump(dataset, open(filename, 'wb'))
print('Saved: %s' % filename)
# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')
# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([testX,testY], 'test.pkl')
回答1:
You should post more of your code, but it sounds like your problem is curating the data. Say you have 240 files in a folder called 'healthy' and 240 files in a folder called 'sick'. Then you need to label all the healthy people with label 0 and all the sick people with label 1. Try something like:
from glob import glob
from sklearn.model_selection import train_test_split
#get the filenames for healthy people
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]
#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]
#get the filenames of sick people
xsick = [ fname for fname in glob( 'pathToData/sick/*')]
#give sick people label of 1
ysick = [ 1 for i in range( len( xsick ))]
#combine the data
xdata = xhealthy + xsick
ydata = yhealthy + ysick
#create the training and test set
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)
Then train your models with X_train, Y_train and test it with X_test, Y_test - keeping in mind that your X_data are just file names that need to still need processing. The more code you post the more people can help with your question.
回答2:
I was able to solve the problem by separating the dataset into train and test sets manually and then labelling each set alone. My current dataset is so small, so I will keep looking for a better solution for large datasets once I have the capacity. Provided to close the question.
来源:https://stackoverflow.com/questions/56535955/split-data-into-training-and-testing