Day 29 利用transformer自己实作一个翻译程序(十一) Decoder layer

每个解码器都包含几个子层

Masked multi-head attention(包含look ahead mask跟padding mask)
Multi-head attention(有padding mask)
V(value)和K(key)接收encoder的输出当作输入，Q(query)接收来自masked multi-head attention子层的输出
Point wise feed forward networks

这些子层的周围都有一个残差连接，然後做正规化

每一个子层的输出是LayerNorm(x + Sublayer(x))

正规化是在d_model的轴上完成的

Transformer中共有N个Encoder layer

由於Q接收decoder第一个attention区块的输出，而K接收encoder的输出，注意力权重表示encoder输出对decoder输入的重要性。换句话来说，decoder通过查看encoder的输出并自我注意自己的输出来预测下一个标记

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2

sample_decoder_layer = DecoderLayer(512, 8, 2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
    False, None, None)

sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)

TensorShape([64, 50, 512])

<<: [Day 21] Mattermost - RSS

>>: Day 16 [Python ML、Pandas] 组成群组合排序

【在 iOS 开发路上的大小事－Day03】透过 Global Variable 来传值

杂谈

铁人赛 Day2 -- SQL到底是什麽东西?讲中文好不好

杂谈

Azure Private Vetwork 手把手教学

杂谈

Day 08 - 网站大流量也不卡? Elastic Load Balancer (ELB) & Auto Scaling

杂谈

[Day20] 在 Codecademy 学 React ～如何宣告 Component 及使用 Component 的好处

杂谈

【DAY 18】数据分析没有这麽难，透过 Microsoft Power BI ，让你事半功倍！

哈罗大家好~ 在这个数据为王的时代，很多人都知道数据的重要性，但除了数据蒐集，视觉化呈现并进行分析，...

每个人都该学的30个Python技巧｜技巧 15：新增或删除串列元素(字幕、衬乐、练习)

今天已经是教串列中的第三篇了～之前讲到如何建立还有读取串列元素，不但可以只提取一个，还可以一次读取很...

Day7:函数返回

我们已经使用过"="给变量命名，今天主要演示的是如何使用"return...

VMware Horizon Client, Compose Server, Horizon Server 每周固定时段异常无法连线 !

状况描述: esxi server 7.0.0 上面运行 vSphere Client 7.0.0 ...

Day 17 - Primitive and Reference

Primitive Data Types 变数拥有值，当某个变数的值赋予给其他变数时，是采用复制的方...