
本小妞迷上赌 提交于 2020-02-19 13:29:48


1 第一次训练(first.py)



  • 第一次训练的数据文件(trec06.csv)


  • 字典文件
  • 第一次训练的训练集和测试集
  • 第一次训练的母模型


#============================== load data ===============================
firstTrainingData <- read the file based on the path of the first training file
#df['label'] <- change label "spam" to 1 and label "ham" to 0 in firstTrainingData
change label "spam" to 1 and label "ham" to 0 in firstTrainingData

#======================= split data =====================================
messages <- the values of label "message" in firstTrainingData
y <- the values of label "label" in firstTrainingData
messages_train,messages_test,y_train,y_test <- cut messages and y at a fixed ratio and fixed randomness

#============= save train set and test set to csv file ==================
trainData <- merge y_train and messages_train columns
testData <- merge y_test and messages_test columns
save trainData and testData each to a local csv file

#====================== CountVectorizer ================================
dictionary <- use CountVectorizer to convert messages to a dictionary 
save dictionary to a local file

#======================== convert ===============================
x_train <- use the dictionary to transform messages_train into bag-of-words features
x_test <- use the dictionary to transform messages_test into bag-of-words features

input_dim <- the number of features of x_train 
sgd <- call SGD from keras with lr as 0.2

model <- call Sequential from keras to serialize a model
add a Dense layer to the model with input dimension as input_dim, output dimension as 10, and the activation function as relu
add a Dense layer to the model with input dimension as 10, output dimension as 1, and the activation function as sigmoid
the model is compiled with sgd

the model is trained using x_train and y_train with epochs as 5 and batch_size as 10

#========================= save model ===============================
save model to a local file,including model,struct and weights

#========================= evaluate ===============================
x_test_file,y_test_file <- get the data converted from the dictionary according to the saved test set file
loss, accuracy <- the model evaluates x_test_file and y_test_file
print loss and accuracy

2 加载经由字典转化的数据(loadtestdata.py)


  • 文件路径参数


  • 字典处理后的结果


function getDataAfterDect(data_path):
    data <- read the file based on data_path      
    messages <- the values of label "message" in data
    label <- the values of label "label" in data
    dictionary <- load dictionary based on local file path
    features <- use the dictionary to transform messages into bag-of-words features 
    return features, label 