每个解码器都包含几个子层
这些子层的周围都有一个残差连接,然後做正规化
每一个子层的输出是LayerNorm(x + Sublayer(x))
正规化是在d_model
的轴上完成的
Transformer中共有N个Encoder layer
由於Q接收decoder第一个attention区块的输出,而K接收encoder的输出,注意力权重表示encoder输出对decoder输入的重要性。换句话来说,decoder通过查看encoder的输出并自我注意自己的输出来预测下一个标记
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(DecoderLayer, self).__init__()
self.mha1 = MultiHeadAttention(d_model, num_heads)
self.mha2 = MultiHeadAttention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
self.dropout3 = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training,
look_ahead_mask, padding_mask):
# enc_output.shape == (batch_size, input_seq_len, d_model)
attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask) # (batch_size, target_seq_len, d_model)
attn1 = self.dropout1(attn1, training=training)
out1 = self.layernorm1(attn1 + x)
attn2, attn_weights_block2 = self.mha2(
enc_output, enc_output, out1, padding_mask) # (batch_size, target_seq_len, d_model)
attn2 = self.dropout2(attn2, training=training)
out2 = self.layernorm2(attn2 + out1) # (batch_size, target_seq_len, d_model)
ffn_output = self.ffn(out2) # (batch_size, target_seq_len, d_model)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layernorm3(ffn_output + out2) # (batch_size, target_seq_len, d_model)
return out3, attn_weights_block1, attn_weights_block2
sample_decoder_layer = DecoderLayer(512, 8, 2048)
sample_decoder_layer_output, _, _ = sample_decoder_layer(
tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
False, None, None)
sample_decoder_layer_output.shape # (batch_size, target_seq_len, d_model)
TensorShape([64, 50, 512])
>>: Day 16 [Python ML、Pandas] 组成群组合排序
哈罗大家好~ 在这个数据为王的时代,很多人都知道数据的重要性,但除了数据蒐集,视觉化呈现并进行分析,...
今天已经是教串列中的第三篇了~之前讲到如何建立还有读取串列元素,不但可以只提取一个,还可以一次读取很...
我们已经使用过"="给变量命名,今天主要演示的是如何使用"return...
状况描述: esxi server 7.0.0 上面运行 vSphere Client 7.0.0 ...
Primitive Data Types 变数拥有值,当某个变数的值赋予给其他变数时,是采用复制的方...