什么是Word2Vec
Word2Vec是从大量文本语料中以无监督的方式学习语义知识的一种模型,它被大量地用在自然语言处理(NLP)中。
Word2Vec其实就是通过学习文本来用词向量的方式表征词的语义信息,即通过一个嵌入空间使得语义上相似的单词在该空间内距离很近。我们从直观角度上来理解一下,cat这个单词和kitten属于语义上很相近的词,而dog和kitten则不是那么相近,iphone这个单词和kitten的语义就差的更远了。
Word2Vec模型实际上分为了两个部分,第一部分为建立模型,第二部分是通过模型获取嵌入词向量。Word2Vec的整个建模过程实际上与自编码器(auto-encoder)的思想很相似,即先基于训练数据构建一个神经网络,当这个模型训练好以后,我们并不会用这个训练好的模型处理新的任务,我们真正需要的是这个模型通过训练数据所学得的参数,例如隐层的权重矩阵。
数据预处理
数据预处理部分主要包括:
- 剔除低频词汇
- 替换特殊字符
- 构建单词映射表
为了使模型高效的计算,我们知道神经网络的输入数据都为数值型,首先需要对输入词进行热编码(one-hot),生成的输入向量是一个稀疏矩阵,为避免消耗过大的资源,将出现频率较低的词汇剔除,并对高频词汇进行抽样。
Word2Vec通过“抽样”模式来解决这种高频词问题。它的基本思想如下:对于我们在训练原始文本中遇到的每一个单词,它们都有一定概率被我们从文本中删掉,而这个被删除的概率与单词的频率有关。
from __future__ import division,print_function,absolute_import
import collections
import os
import random
import urllib.request
import zipfile
import numpy as np
import tensorflow as tf
max_vocabulary_size = 50000 #词汇表中非重复单词个数
min_occurrence = 10 #剔除所有出现次数小于该值的单词
data_path = 'text8.zip'
#解压该文件,数据已被处理
with zipfile.ZipFile(data_path) as f:
text_words = f.read(f.namelist()[0]).lower().split()
#构建单词映射表并使用UNK代替特殊字符
count = [('UNK', -1)]
#寻找高频词汇
count.extend(collections.Counter(text_words).most_common(max_vocabulary_size - 1))
for i in range(len(count) - 1, -1):
if count[i][1] < min_occurrence:
count.pop(i)
else:
break
#计算单词的规模大小
vocabulary_size = len(count)
#为每个单词指定id
word2id = dict()
for i, (word, _)in enumerate(count):
word2id[word] = i
data = list()
unk_count = 0
for word in text_words:
#如果该单词不存在字典中,使用UNK代替
index = word2id.get(word, 0)
if index == 0:
unk_count += 1
data.append(index)
count[0] = ('UNK', unk_count)
id2word = dict(zip(word2id.values(), word2id.keys()))
print("Words count:", len(text_words))
print("Unique words:", len(set(text_words)))
print("Vocabulary size:", vocabulary_size)
print("Most common words:", count[:10])
输出的结果:
Words count: 17005207
Unique words: 253854
Vocabulary size: 50000
Most common words: [('UNK', 418391), (b'the', 1061396), (b'of', 593677), (b'and', 416629), (b'one', 411764), (b'in', 372201), (b'a', 325873), (b'to', 316376), (b'zero', 264975), (b'nine', 250430)]
接下来对这部分代码进行更详细的讲解:
1 division,print_function,absolute_import这三个函数的功能就是python2 为了适配python3格式做的补充,所有如果是python3的话,是不需要导入的。
参考链接:from future import absolute_import,division,print_function的作用
from __future__ import division,print_function,absolute_import
2 从压缩包里解压缩出一个文件的方法是使用ZipFile的read方法:f.read(f.namelist()[0])读取出f.namelist()中的第一个文件。lower()返回将字符串中所有大写字符转换为小写后生成的字符串。split()返回分割后的字符串列表,以空格为分隔符,包括\n。
参考链接:
python模块 zipfile
split()函数介绍
with zipfile.ZipFile(data_path) as f:
text_words = f.read(f.namelist()[0]).lower().split()
3 enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
参考链接:Python enumerate() 函数
word2id = dict()
for i, (word, _)in enumerate(count): #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
word2id[word] = i
训练样本构建
skip-gram是基于一个input word来预测上下文,因此一个input word会对应多个上下文。
假如我们有一个句子“The dog barked at the mailman”。
首先我们选句子中间的一个词作为我们的输入词,例如我们选取“dog”作为input word;
有了input word以后,我们再定义一个叫做skip_window的参数,它代表着我们从当前input word的一侧(左边或右边)选取词的数量。如果我们设置skip_window=2,那么我们最终获得窗口中的词(包括input word在内)就是[‘The’, ‘dog’,‘barked’, ‘at’]。skip_window=2代表着选取左input word左侧2个词和右侧2个词进入我们的窗口,所以整个窗口大小span=2*2=4。另一个参数叫num_skips,它代表着我们从整个窗口中选取多少个不同的词作为我们的output word,当skip_window=2,num_skips=2时,我们将会得到两组(input word, output word) 形式的训练数据,即 (‘dog’, ‘barked’),(‘dog’, ‘the’)。
我们定义了一个next_batch函数,来获取下一次的batch数据。代码实现:
data_index = 0
#Skip-Gram模型生成训练batch
def next_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
#计算窗口大小(words left and right + current one)
span = 2 * skip_window + 1
buffer = collections.deque(maxlen=span)
if data_index + span > len(data):
data_index = 0
buffer.extend(data[data_index:data_index + span])
data_index += span
for i in range(batch_size // num_skips):
context_words = [w for w in range(span) if w != skip_window]
words_to_use = random.sample(context_words, num_skips)
for j, context_word in enumerate(words_to_use):
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[context_word]
if data_index == len(data):
buffer.extend(data[0:span])
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
#将索引适当回溯,避免结束单词被跳过
data_index = (data_index + len(data) - span) % len(data)
return batch, labels
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
buffer = collections.deque(maxlen=span)
模型构建
在模型中为了加速训练并提高词向量的质量,我们采用负采样方式进行权重更新。不同于原本每个训练样本更新所有的权重,负采样每次让一个训练样本仅仅更新一小部分的权重,这样就会降低梯度下降过程中的计算量。如果使用了负采样的方法我们仅仅去更新我们的positive word-“quick”的和我们选择的其他5个negative words的结点对应的权重,共计6个输出神经元,相当于每次只更新300*6=1800个权重。对于3百万的权重来说,相当于只计算了0.06%的权重,这样计算效率就大幅度提高。
# Word2Vec参数值
embedding_size = 200 # Dimension of the embedding vector
skip_window = 3 # How many words to consider left and right
num_skips = 2 # How many times to reuse an input to generate a label
num_sampled = 64 # Number of negative examples to sample
#输入数据
X = tf.placeholder(tf.int32, shape=[None])
#输入标签
Y = tf.placeholder(tf.int32, shape=[None, 1])
#使用CPU训练数据
with tf.device('/cpu:0'):
# 创建嵌入变量(每行代表一个单词嵌入向量)
embedding = tf.Variable(tf.random_normal([vocabulary_size, embedding_size]))
#在X中的每个单词寻找相关的嵌入向量
X_embed = tf.nn.embedding_lookup(embedding, X)
#构造NCE损失函数
nce_weights = tf.Variable(tf.random_normal([vocabulary_size, embedding_size]))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
#计算每次batch的平均NCE损失
loss_op = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=Y,
inputs=X_embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
#定义优化器
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss_op)
#评估
#计算词向量相似度
X_embed_norm = X_embed / tf.sqrt(tf.reduce_sum(tf.square(X_embed)))
embedding_norm = embedding / tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keepdims=True))
cosine_sim_op = tf.matmul(X_embed_norm, embedding_norm, transpose_b=True)
模型验证
在上面的步骤中,我们已经将模型的框架搭建出来,下面就让我们来训练训练一下模型。为了能够更加直观地观察训练每个阶段的情况。我们来挑选几个词,看看在训练过程中它们的相似词是怎么变化的。尽量不要经常去让代码打印验证集相似的词,因为这里会多了一步计算步骤,就是计算相似度,会非常消耗计算资源,计算过程也很慢。所以代码中设置1000轮打印一次结果。
代码实现:
#训练参数
learning_rate = 0.1
batch_size = 128
num_steps = 3000000
display_step = 10000
eval_step = 200000
#input_word
eval_words = ['five', 'of', 'going', 'hardware', 'american', 'britain']
#初始化参数
init = tf.global_variables_initializer()
with tf.Session() as sess:
#初始化
sess.run(init)
#生成测试数据
x_test = np.array([word2id[w.encode('gbk')] for w in eval_words])
average_loss = 0
for step in range(1, num_steps + 1):
#得到新的batch
batch_x, batch_y = next_batch(batch_size, num_skips, skip_window)
#运行训练函数
_, loss = sess.run([train_op, loss_op], feed_dict={X: batch_x, Y: batch_y})
average_loss += loss
#每训练10000step,输出一次损失值
if step % display_step == 0 or step == 1:
if step > 1:
average_loss /= display_step
print("Step " + str(step) + ", Average Loss= " + \
"{:.4f}".format(average_loss))
average_loss = 0
#每训练200000step,输出当前训练结果
if step % eval_step == 0 or step == 1:
print("Evaluation...")
sim = sess.run(cosine_sim_op, feed_dict={X: x_test})
for i in range(len(eval_words)):
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1]
log_str = '"%s" nearest neighbors:' % eval_words[i]
for k in range(top_k):
log_str = '%s %s,' % (log_str, id2word[nearest[k]])
print(log_str)
所有代码
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 4 09:26:12 2020
@author: Administrator
"""
from __future__ import division,print_function,absolute_import
import collections
import os
import random
import urllib.request
import zipfile
import numpy as np
import tensorflow as tf
max_vocabulary_size = 50000 #词汇表中非重复单词个数
min_occurrence = 10 #剔除所有出现次数小于该值的单词
data_path = 'text8.zip'
#解压该文件,数据已被处理
with zipfile.ZipFile(data_path) as f:
text_words = f.read(f.namelist()[0]).lower().split()
#构建单词映射表并使用UNK代替特殊字符
count = [('UNK', -1)]
#寻找高频词汇
count.extend(collections.Counter(text_words).most_common(max_vocabulary_size - 1))
#剔除出现频率小于给定频率的单词
for i in range(len(count) - 1, -1):
if count[i][1] < min_occurrence:
count.pop(i)
else:
break
#计算单词的规模大小
vocabulary_size = len(count)
#为每个单词指定id
word2id = dict()
for i, (word, _)in enumerate(count):
word2id[word] = i
data = list()
unk_count = 0
for word in text_words:
#如果该单词不存在字典中,使用UNK代替
index = word2id.get(word, 0)
if index == 0:
unk_count += 1
data.append(index)
count[0] = ('UNK', unk_count)
id2word = dict(zip(word2id.values(), word2id.keys()))
print("Words count:", len(text_words))
print("Unique words:", len(set(text_words)))
print("Vocabulary size:", vocabulary_size)
print("Most common words:", count[:10])
data_index = 0
#Skip-Gram模型生成训练batch
def next_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
#计算窗口大小(words left and right + current one)
span = 2 * skip_window + 1
buffer = collections.deque(maxlen=span)
if data_index + span > len(data):
data_index = 0
buffer.extend(data[data_index:data_index + span])
data_index += span
for i in range(batch_size // num_skips):
context_words = [w for w in range(span) if w != skip_window]
words_to_use = random.sample(context_words, num_skips)
for j, context_word in enumerate(words_to_use):
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[context_word]
if data_index == len(data):
buffer.extend(data[0:span])
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
#将索引适当回溯,避免结束单词被跳过
data_index = (data_index + len(data) - span) % len(data)
return batch, labels
# Word2Vec参数值
embedding_size = 200 # Dimension of the embedding vector
skip_window = 3 # How many words to consider left and right
num_skips = 2 # How many times to reuse an input to generate a label
num_sampled = 64 # Number of negative examples to sample
#输入数据
X = tf.placeholder(tf.int32, shape=[None])
#输入标签
Y = tf.placeholder(tf.int32, shape=[None, 1])
#使用CPU训练数据
with tf.device('/cpu:0'):
# 创建嵌入变量(每行代表一个单词嵌入向量)
embedding = tf.Variable(tf.random_normal([vocabulary_size, embedding_size]))
#在X中的每个单词寻找相关的嵌入向量
X_embed = tf.nn.embedding_lookup(embedding, X)
#构造NCE损失函数
nce_weights = tf.Variable(tf.random_normal([vocabulary_size, embedding_size]))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
#计算每次batch的平均NCE损失
loss_op = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=Y,
inputs=X_embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
#定义优化器
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss_op)
#评估
#计算词向量相似度
X_embed_norm = X_embed / tf.sqrt(tf.reduce_sum(tf.square(X_embed)))
embedding_norm = embedding / tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keepdims=True))
cosine_sim_op = tf.matmul(X_embed_norm, embedding_norm, transpose_b=True)
#训练参数
learning_rate = 0.1
batch_size = 128
num_steps = 3000000
display_step = 10000
eval_step = 200000
#input_word
eval_words = ['five', 'of', 'going', 'hardware', 'american', 'britain']
#初始化参数
init = tf.global_variables_initializer()
with tf.Session() as sess:
#初始化
sess.run(init)
#生成测试数据
x_test = np.array([word2id[w.encode('gbk')] for w in eval_words])
average_loss = 0
for step in range(1, num_steps + 1):
#得到新的batch
batch_x, batch_y = next_batch(batch_size, num_skips, skip_window)
#运行训练函数
_, loss = sess.run([train_op, loss_op], feed_dict={X: batch_x, Y: batch_y})
average_loss += loss
#每训练10000step,输出一次损失值
if step % display_step == 0 or step == 1:
if step > 1:
average_loss /= display_step
print("Step " + str(step) + ", Average Loss= " + \
"{:.4f}".format(average_loss))
average_loss = 0
#每训练200000step,输出当前训练结果
if step % eval_step == 0 or step == 1:
print("Evaluation...")
sim = sess.run(cosine_sim_op, feed_dict={X: x_test})
for i in range(len(eval_words)):
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1]
log_str = '"%s" nearest neighbors:' % eval_words[i]
for k in range(top_k):
log_str = '%s %s,' % (log_str, id2word[nearest[k]])
print(log_str)
输出结果:
来源:CSDN
作者:沉迷游戏的鱼
链接:https://blog.csdn.net/qq_40499451/article/details/104661957