[常见的自然语言处理技术] 文本相似度(IV): 建立自己的Word2vec模型

前言

原本以为文本相似度这个主题两天就可以结束了，没想到花了四天来讲。今天将会是介绍自然语言处理基础的最後一篇，就让我们做个客制化的 embedding model 来做个小收尾。

Word2vec模型-续

延续昨天关於 CBoW 的探讨以及实作，我们今天将会让神经网路进行学习，以建立二维的 word embeddings 。

连续词袋模型架构（CBoW）-续

由於我後来用了 Tensorflow 1.X 的作法进行模型训练，因此今天的模型定义法会与昨天有些不同。我们将经由 CBoW 演算法得出的训练词对(context, target)一一列出：

# Build a CBoW (contex, target) generator

from sklearn.feature_extraction.text import CountVectorizer

# set context_length
context_length = 2

# function to get cbows
def get_cbow_datapairs(tokens, context_length):
    cbows = list()
    for i, target in enumerate(tokens):
        if i < context_length:
            pass
        elif i < len(tokens) - context_length:
            context = tokens[i - context_length : i] + tokens[i + 1 : i + context_length + 1]
            vectoriser = CountVectorizer()
            vectoriser.fit_transform(context)
            context_no_order = vectoriser.get_feature_names()
            for word in context_no_order:
                cbows.append([word, target])
    return cbows
# generate data pairs
cbows_data = get_cbow_datapairs(tokens, context_length)

# prints out dataset
for cbow in cbows_data:
    print(cbow)

我们总共得到了33对 [context word, target word] （ context word 与 target word 分别为特徵与标签）：

将每一笔训练资料对中的单词都进行 one-hot 编码：

def get_onehot_list(word, vocab):
    onehot_encoded = [0] * len(vocab)
    if word in vocab:
        onehot_encoded[vocab[word]] = 1
    return onehot_encoded


X_train = list()
y_train = list()

# one-hot encode each data pair
for i in range(len(cbows_data)):
    X_train.append(get_onehot_list(cbows_data[i][0], vocab))
    y_train.append(get_onehot_list(cbows_data[i][1], vocab))
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
print("X_train: ", X_train, ", size: ", X_train.shape) # (33, 8)
print("y_train:", y_train, ", size: ", y_train.shape) # (33, 8)

我们采用 Tensorflow 作为建构网络的框架（ framework ）。由於今天我使用 Tensorflow 1.X 语法来设计 Word2vec 浅层网络，若是使用 Tensorflow 2.X 版本的小夥伴可以额外加入以下的程序码：

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

昨天我们将 target word 的 contexts 一并输入，故输入层维度是 C x V，其中 C 即是两倍的 context length ， V 是词汇量（以我们的例子是8）。今天我们在训练资料的准备上做了点手脚，将原有的( context words, target word )都「摊开」了，因此输入层的维度即是 V。
接下来，开始建构输入层到隐藏层之间的权重 W1 以及 bias b1 。所谓的 word embedding，以我们的例子而言，即是经过 one-hot 编码之後传入隐藏层的二维向量。从输入层到隐藏层之间的神经网络又称为编码器（ encoder ）。而神经网络的另一个部分则为解码器（ decoder ），由隐藏在二维向量转为原先维度V的向量，经过 softmax 对各个维度进行机率估计，以此来接近经过 one-hot 编码的 target word 。

x = tf.placeholder(tf.float32, shape = (None, vocab_size))
y_label = tf.placeholder(tf.float32, shape = (None, vocab_size))

# Build our model- Embedding Part
embed_dim = 2 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, embed_dim]))
b1 = tf.Variable(tf.random_normal([embed_dim])) #bias
hidden_repre = tf.add(tf.matmul(x, W1), b1)


W2 = tf.Variable(tf.random_normal([embed_dim, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
predict = tf.nn.softmax(tf.add( tf.matmul(hidden_repre, W2), b2))

接下来就是训练的时刻了，整个训练过程将会经过5000个训练回合：

# Start training
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(predict), reduction_indices = [1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_epochs = 5000
# train for n_iter iterations
with tf.device("GPU:0"):
    print("Training with GPU:")
    start = time.time()
    for n in range(n_epochs):
        sess.run(train_step, feed_dict = {x: X_train, y_label: y_train})
        # print("epoch {}: loss is {}".format(n, sess.run(cross_entropy_loss, feed_dict = {x: X_train, y_label: y_train})))
    print("Training is done! Time spent: {} s".format(time.time() - start))

历时12秒训练完成！接下来我们测试一下 "king" 这个单词的 word embedding ：

# predict word
vectors = sess.run(W1 + b1)
text_word = "king"
word_id = vocab[text_word]
print("word embedding of {} is {}".format(text_word, vectors[word_id]))

其二维word embedding如下：

接着我们使用scikit-learn 工具包当中的 t-SNE 将词汇表中的每个单词呈现在二维平面上：

from sklearn.manifold import TSNE
from sklearn import preprocessing
import matplotlib.pyplot as plt


model = TSNE(n_components = 2, random_state = 0)
np.set_printoptions(suppress = True)
vectors = model.fit_transform(vectors)


normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, "l2")

fig, ax = plt.subplots(figsize = (10, 8))
fig.suptitle("My Word Embeddings", fontsize = 20)
ax.set_xlim([-1.5, 1.5])
ax.set_ylim([-1.5, 1.5])
for token in tokens:
    print(token, vectors[vocab[token]][1])
    ax.annotate(token, (vectors[vocab[token]][0], vectors[vocab[token]][1] ))
plt.show()

从图上我们可以观察每个单词分布的状况，也可以藉由 cosine distance 找出最接近的单词：

text = "queen"
closest_word_cos = idx2word(find_closest_cosine(vocab[text_word], vectors), vocab)
print("using cosine distance:", end = ' ')
print("'{}' is closest to '{}'".format(text_word, closest_word_cos))
# using cosine distance: 'queen' is closest to 'woman'

跳跃式模型架构（Skip-Gram）

另一种取出( context, target )的演算法为跳跃式模型（ Skip-Gram, SG ），其是藉由中心单词来推敲上下文序列。值得注意的是， CBoW 藉由嵌入每个 context word 再平均来得出藏在隐藏层的 word embedding，所以上下文的排序并不重要。而在 Skip-Gram 中， context words 的顺序很重要。对於这个演算法的介绍，我们停留在概念介绍，就不像 CBoW 一样一步一步定义模型，打造 word embeddings。

Skip-Gram 模型观察中间的单词来推敲上下文：

图片来源：Practical Natural Language Processing by Sowmya Vajjala et al.

结论

除了使用 Tensorflow 、 PyTorch 等框架来从头建立Word2vec模型，我们也可以透过套件 Gensim 来客制化属於我们自己的 word embedding models ，有兴趣的读者可以参考下方的文章连结。今天的介绍就到此为止，耗时四天的文本相似度介绍也正式划下句点。明天我们将快速回顾深度学习的概念以及重要模型，为之後建造属於我们自己的翻译器铺上一条康庄大道！

[常见的自然语言处理技术] 文本相似度(IV): 建立自己的Word2vec模型

前言

Word2vec模型-续

连续词袋模型架构（CBoW）-续

跳跃式模型架构（Skip-Gram）

结论

阅读更多

[Day22] 发送验证信API – views

Day26-JDK可视化监控工具：visualVM(二)

渗透测试Web篇

予焦啦！结论与展望（一）：Hoddarla 专案的过去、现在与未来

AES（高级加密标准）

[Day 13] C#改造程序码( Func<T, TResult> )教学(下)

LineBot - 图文选单

离职倒数21天：「欸，蒲公英是什麽颜色？」谈工作上的沟通问题

Day6 风生水起,观元辰宫的木

Day16 vue.js之我有帐户了!!!