Day04 - 端到端(end-to-end)语音辨识-attention 机制

如果近几年来有在关注深度学习技术发展的话，一定有听过 attention model 以及 Attention Is All You Need 这篇非常有名的论文，论文的细节这边就不多谈了，网路都可以找的到相当丰富的说明、实作。这边主要要探讨的是如何使用 attention 解决之前提到的输入序列过长会导辨识效果不佳的问题。

Attention 的核心概念可以以下这句话来描述:

Attention: At different steps, let a model "focus" on different parts of the input.

Attention 机制的加入，会让 decoder 在解码的过程中找出输入序列中哪些部分较为重要，因此 encoder 不需要再将输入序列压缩成一固定大小的 context vector。

而 decoder 中的每一个 hidden state 都会有不同的 context vector，也就是说如果输入序列有 N 个音框(frame)，就会产生 N 个 context vector。接下来我们用数学式子来表示 attention 的运作过程:

all encoder hidden states: $s_{1},s_{2},...,s_{m}$
decoder hidden state at timestamp t : $h_{t}$
Attention weights: 将目前时间点 (timestamp) decoder 的 hidden state $h_{t}$ 对所有 encoder 的 hidden state 进行 score function，再透过 softmax function 计算出 $h_{t}$ 对每一个时间点 $s_{i}$ 的重要程度
$https://chart.googleapis.com/chart?cht=tx&chl=%5Calpha_%7Bk%7D(t)%3D%5Cfrac%7Bexp(score(h_%7Bt%7D%2Cs_%7Bk%7D))%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bm%7Dexp(score(h_%7Bt%7D%2Cs_%7Bi%7D))%7D%2C%20%5C%20k%3D1..%20m$

其中score function 的运算方式有好几种，包括 dot-product, bilinear, multi-layer perceptron 等
Context vector: 将 attention weights 与 encoder hidden state $h_{s}$ 进行 weighted sum

$c_{t}=\sum_{k=1}^{m}\alpha_{k}(t)s_{k}$