该篇文章讲述了论文实验部分的伪代码,该实验采用python语言编写,框架采用深度学习框架keras,整体实验分为一下几个部分:
1 第一次训练(first.py)
功能实现:
根据输入的数据文件,处理数据后,切割为训练集和测试集,并在本地生成对应的文件。对整体数据,使用CountVectorizer对邮件文本进行向量化,并且生成了一个字典。用词袋模型将训练集的邮件文本数据转化为词袋特征,并用这些特征训练模型,将该模型生成本地文件。最后,加载训练集文件对模型进行评估,自此该文件运行完毕。
输入:
- 第一次训练的数据文件(trec06.csv)
输出:
- 字典文件
- 第一次训练的训练集和测试集
- 第一次训练的母模型
Pseudocode:
#============================== load data ===============================
firstTrainingData <- read the file based on the path of the first training file
#df['label'] <- change label "spam" to 1 and label "ham" to 0 in firstTrainingData
change label "spam" to 1 and label "ham" to 0 in firstTrainingData
#======================= split data =====================================
messages <- the values of label "message" in firstTrainingData
y <- the values of label "label" in firstTrainingData
messages_train,messages_test,y_train,y_test <- cut messages and y at a fixed ratio and fixed randomness
#============= save train set and test set to csv file ==================
trainData <- merge y_train and messages_train columns
testData <- merge y_test and messages_test columns
save trainData and testData each to a local csv file
#====================== CountVectorizer ================================
dictionary <- use CountVectorizer to convert messages to a dictionary
save dictionary to a local file
#======================== convert ===============================
x_train <- use the dictionary to transform messages_train into bag-of-words features
x_test <- use the dictionary to transform messages_test into bag-of-words features
#=====================利用keras进行sgd训练===================================
input_dim <- the number of features of x_train
sgd <- call SGD from keras with lr as 0.2
model <- call Sequential from keras to serialize a model
add a Dense layer to the model with input dimension as input_dim, output dimension as 10, and the activation function as relu
add a Dense layer to the model with input dimension as 10, output dimension as 1, and the activation function as sigmoid
the model is compiled with sgd
the model is trained using x_train and y_train with epochs as 5 and batch_size as 10
#========================= save model ===============================
save model to a local file,including model,struct and weights
#========================= evaluate ===============================
x_test_file,y_test_file <- get the data converted from the dictionary according to the saved test set file
loss, accuracy <- the model evaluates x_test_file and y_test_file
print loss and accuracy
2 加载经由字典转化的数据(loadtestdata.py)
功能实现:
对于给定的文件路径,对文件进行读取后,载入字典对读取数据进行处理,最后返回处理后的结果
输入:
- 文件路径参数
输出:
- 字典处理后的结果
Pseudocode:
#=======================预处理,载入第二次的test_data,为模型进行评估===========
function getDataAfterDect(data_path):
data <- read the file based on data_path
messages <- the values of label "message" in data
label <- the values of label "label" in data
dictionary <- load dictionary based on local file path
features <- use the dictionary to transform messages into bag-of-words features
return features, label
来源:CSDN
作者:爱学习,多喝水,要听话
链接:https://blog.csdn.net/qq_37195179/article/details/104380670