------------------邮件数据预处理------------------
一:邮件数据读取
with open('emailSample1.txt','r') as fp:
content = fp.read() #一次读取了全部数据
print(content)
二:预处理操作
(一)预处理内容
预处理主要包括以下9个部分:
1. 将大小写统一成小写字母;
2. 移除所有HTML标签,只保留内容。
3. 将所有的网址替换为字符串 “httpaddr”.
4. 将所有的邮箱地址替换为 “emailaddr”
5. 将所有dollar符号($)替换为“dollar”.
6. 将所有数字替换为“number”
7. 将所有单词还原为词源,词干提取
8. 移除所有非文字类型
9. 去除空字符串‘’
(二)预处理实现读取邮件
import re
import nltk.stem as ns
def preprocessing(email):
#1. 将大小写统一成小写字母;
email = email.lower()
#2. 移除所有HTML标签,只保留内容
email = re.sub("<[^<>]>"," ",email) #找到<>标签进行替换,注意:我们匹配的<>标签中内部不能含有<>---<<>>---最小匹配,其他方法也可以实现
#3. 将所有的网址替换为字符串 “httpaddr”.
email = re.sub("(http|https)://[^\s]*","httpaddr",email) #匹配网址,直到遇到空格
#4. 将所有的邮箱地址替换为 “emailaddr”
email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email) #匹配邮箱 @前后为空为截止
#5. 将所有dollar符号($)替换为“dollar”
email = re.sub('[\$]+', 'dollar', email)
#6. 将所有数字替换为“number”
email = re.sub('[0-9]+', 'number', email)
#7. 将所有单词还原为词源,词干提取 https://www.jb51.net/article/63732.htm
tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email) #使用以上字符进行切分内容为单词
tokenlist = []
s = ns.SnowballStemmer('english') #一个词干提取器对象
for token in tokens:
#8. 移除所有非文字类型
token = re.sub("[^a-zA-Z0-9]",'',token)
#9. 去除空字符串‘’
if not len(token):
continue
# print("---: ",token)
stemmed = s.stem(token) #获取词根 costs变为cost expecting变为expect
# print("++++++: ",stemmed)
tokenlist.append(stemmed)
return tokenlist
with open('emailSample1.txt','r') as fp:
content = fp.read()
email = preprocessing(content)
(三)将Email转化为词向量
将提取的单词转换成特征向量:
def email2VocabIndices(email,vocab):
"""
提取我们邮件email中的单词,在vocab单词表中出现的索引
"""
token = preprocessing(email) #获取预处理后的邮件单词列表
#获取邮件单词在垃圾邮件单词表的索引位置
index_list = [i for i in range(len(vocab)) if vocab[i] in token]
return index_list
def email2FeatureVector(email):
"""
将email转为词向量
"""
#读取提供给我们的单词(被认为是垃圾邮件的单词)
df = pd.read_table("vocab.txt",names=['words']) #读取数据
vocab = df.values #array类型
# print(vocab)
vector = np.zeros(vocab.shape[0])
#查看我们我们的邮件中是否存在这些单词,如果存在则返回索引(是在上面vocab中的索引)
index_list = email2VocabIndices(email,vocab)
print(index_list)
#将单词索引转向量
for i in index_list:
vector[i] = 1
return vector
with open('emailSample1.txt','r') as fp:
content = fp.read()
vector = email2FeatureVector(content)
print(vector)
[70, 85, 88, 161, 180, 237, 369, 374, 430, 478, 529, 530, 591, 687, 789, 793, 798, 809, 882, 915, 944, 960, 991, 1001, 1061, 1076, 1119, 1161, 1170, 1181, 1236, 1363, 1439, 1476, 1509, 1546, 1662, 1698, 1757, 1821, 1830, 1892, 1894, 1895]
[0. 0. 0. ... 0. 0. 0.]
print(vector.shape)
------------------垃圾邮件过滤参数问题(线性核函数)------------------
注意:data/spamTrain.mat是对邮件进行预处理后(自然语言处理)获得的向量。
一:数据读取
data1 = sio.loadmat("spamTrain.mat")
X,y = data1['X'],data1['y'].flatten()
data2 = sio.loadmat("spamTest.mat")
Xtest,ytest = data2['Xtest'],data2['ytest'].flatten()
print(X.shape,y.shape)
print(X) #每一行代表一个邮件样本,每个样本有1899个特征,特征为1表示在跟垃圾邮件有关的语义库中找到相关单词
print(y) # 每一行代表一个邮件样本,等于1表示为垃圾邮件
这里用于判断垃圾邮件的词汇,还是我们预处理中的vocab.txt的单词。
二:获取最佳准确率和参数C(线性核只有C)
def get_best_params(X,y,Xval,yval):
# C 和 σ 的候选值
Cvalues = [3, 0.03, 100, 0.01, 0.1, 0.3, 1, 10, 30] # 9
best_score = 0 #用于存放最佳准确率
best_params = 0 #用于存放参数C
for c in Cvalues:
clf = svm.SVC(c,kernel='linear')
clf.fit(X,y) # 用训练集数据拟合模型
score = clf.score(Xval,yval) # 用验证集数据进行评分
if score > best_score:
best_score = score
best_params = c
return best_score,best_params
best_score,best_params = get_best_params(X,y,Xtest,ytest)
print(best_score,best_params)
clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y)
score_train = clf.score(X,y)
score_test = clf.score(Xtest,ytest)
print(score_train,score_test)
三:根据我们训练的结果,判断最有可能被判为垃圾邮件的单词
best_params = 0.3
clf = svm.SVC(best_params,kernel="linear")
clf.fit(X,y) #进行训练
# 读取提供给我们的单词(被认为是垃圾邮件的单词)
df = pd.read_table("vocab.txt", names=['words']) # 读取数据
vocab = df.values # array类型
#先获取训练结果中,各个特征的重要性 ---- coef_:各特征的系数(重要性)https://www.cnblogs.com/xxtalhr/p/11166848.html
print(clf.coef_)
indices = np.argsort(clf.coef_).flatten()[::-1] #因为argsort是小到大排序,提取其对应的index(索引),我们设置步长为-1,倒序获取特征系数
print(indices) #打印索引
for i in range(15): #我们获取前15个最有可能判为垃圾邮件的单词
print("{}---{:0.6f}".format(vocab[indices[i]],clf.coef_.flatten()[indices[i]]))
------------------垃圾邮件判断------------------
一:邮件数据
(一)正常邮件
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100.
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2
if youre running something big..
To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com
Folks,
my first time posting - have a bit of Unix experience, but am new to Linux.
Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
fine except it didn't pick up my monitor.
I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
Ti4200 video card, both of which are probably too new to feature in Suse's default
set. I downloaded a driver from the nVidia website and installed it using RPM.
Then I ran Sax2 (as was recommended in some postings I found on the net), but
it still doesn't feature my video card in the available list. What next?
Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
the whole machine crashes (in Linux, not Windows) - even the on/off switch is
inactive, leaving me to reach for the power cable instead.
If anyone can help me in any way with these probs., I'd be really grateful -
I've searched the 'net but have run out of ideas.
Or should I be going for a different version of Linux such as RedHat? Opinions
welcome.
Thanks a lot,
Peter
--
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie
(二)垃圾邮件
Do You Want To Make $1000 Or More Per Week?
If you are a motivated and qualified individual - I
will personally demonstrate to you a system that will
make you $1,000 per week or more! This is NOT mlm.
Call our 24 hour pre-recorded number to get the
details.
000-456-789
I need people who want to make serious money. Make
the call and get the facts.
Invest 2 minutes in yourself now!
000-456-789
Looking forward to your call and I will introduce you
to people like yourself who
are currently making $10,000 plus per week!
000-456-789
3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72
Best Buy Viagra Generic Online
Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!
We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru
二:获取各个邮件的词向量特征(预处理获得)
with open('emailSample1.txt','r') as fp:
content = fp.read()
with open('emailSample2.txt','r') as fp:
content2 = fp.read()
with open('spamSample1.txt','r') as fp:
content3 = fp.read()
with open('spamSample2.txt','r') as fp:
content4 = fp.read()
vector = email2FeatureVector(content)
vector2 = email2FeatureVector(content2)
vector3 = email2FeatureVector(content3)
vector4 = email2FeatureVector(content4)
三:提供训练的参数,进行预测邮件是否为垃圾邮件
res = clf.predict(np.array([vector]))
res2 = clf.predict(np.array([vector2]))
res3 = clf.predict(np.array([vector3]))
res4 = clf.predict(np.array([vector4]))
print(res,res2,res3,res4)
来源:oschina
链接:https://my.oschina.net/u/4419131/blog/4287525