Day 29 利用transformer自己实作一个翻译程序(十一) Decoder layer

每个解码器都包含几个子层

  1. Masked multi-head attention(包含look ahead mask跟padding mask)
  2. Multi-head attention(有padding mask)
    V(value)和K(key)接收encoder的输出当作输入,Q(query)接收来自masked multi-head attention子层的输出
  3. Point wise feed forward networks

这些子层的周围都有一个残差连接,然後做正规化

每一个子层的输出是LayerNorm(x + Sublayer(x))

正规化是在d_model的轴上完成的

Transformer中共有N个Encoder layer

由於Q接收decoder第一个attention区块的输出,而K接收encoder的输出,注意力权重表示encoder输出对decoder输入的重要性。换句话来说,decoder通过查看encoder的输出并自我注意自己的输出来预测下一个标记

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2
sample_decoder_layer = DecoderLayer(512, 8, 2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
    False, None, None)

sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)
TensorShape([64, 50, 512])

<<:  [Day 21] Mattermost - RSS

>>:  Day 16 [Python ML、Pandas] 组成群组合排序

【DAY 18】数据分析没有这麽难,透过 Microsoft Power BI ,让你事半功倍!

哈罗大家好~ 在这个数据为王的时代,很多人都知道数据的重要性,但除了数据蒐集,视觉化呈现并进行分析,...

每个人都该学的30个Python技巧|技巧 15:新增或删除串列元素(字幕、衬乐、练习)

今天已经是教串列中的第三篇了~之前讲到如何建立还有读取串列元素,不但可以只提取一个,还可以一次读取很...

Day7:函数返回

我们已经使用过"="给变量命名,今天主要演示的是如何使用"return...

VMware Horizon Client, Compose Server, Horizon Server 每周固定时段异常无法连线 !

状况描述: esxi server 7.0.0 上面运行 vSphere Client 7.0.0 ...

Day 17 - Primitive and Reference

Primitive Data Types 变数拥有值,当某个变数的值赋予给其他变数时,是采用复制的方...