3.特征提取
将使用特征提取函数。函数代码也与之前类似,该函数具体如下:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def build_feature_matrix(documents, feature_type = 'frequency' ,
ngram_range = ( 1 , 1 ), min_df = 0.0 , max_df = 1.0 ):
feature_type = feature_type.lower().strip()
if feature_type = = 'binary' :
vectorizer = CountVectorizer(binary = True , min_df = min_df,
max_df = max_df, ngram_range = ngram_range)
elif feature_type = = 'frequency' :
vectorizer = CountVectorizer(binary = False , min_df = min_df,
max_df = max_df, ngram_range = ngram_range)
elif feature_type = = 'tfidf' :
vectorizer = TfidfVectorizer(min_df = min_df, max_df = max_df,
ngram_range = ngram_range)
else :
raise Exception( "Wrong feature type entered. Possible values: 'binary', 'frequency', 'tfidf'" )
feature_matrix = vectorizer.fit_transform(documents).astype( float )
return vectorizer, feature_matrix
|
从函数定义可以看出,它可以提取词袋模型频率、出现次数以及基于 TF-IDF 的特征。此函数新增 min_df、max_dfC 和 ngram_range 参数,并将其设为可选参数。当要添加二元分词、三元分词等作为附加特征时,ngram_range 参数会将十分有用。min_df 参数可以由 [ 0.0, 1.0] 范围内的阈值表示,并将忽略文档频率低于输入阈值的特征。这样做的原因是,如果这些词语出现几乎所有的文本中,那么它们对于区分不同文件的类型往往没有多少价值。