昨天我们探讨了 L1 与 L2 Regularizers 的问题,但其实啊,Regularizers 还有一个和 Batch Normalization 有个有趣的关系,这篇文章 L2 Regularization and Batch Norm 提到了同时使用 BN 和 Regularizers 时,所产生的有趣问题。
首先,我们知道套用 Regularizers 後,会对权重产生一个小於1的衰变(decay),
让权重更新往0靠近些,这个(1−αc)的值我们称作为λ。
另一方面,经过 CNN 或 Dense 层输出的数值通常会再经过 BN 层,BN 层会对该次 batch 计算平均 mean 和变异数 var,并正规化输出:
而刚刚λ倍缩放了权重,也等於是缩放了输出,对於BN的 mean 和 var 也一样做了λ倍的缩放,所以套回原本的 BN 公式,分子和分母同时乘上 lambda 就刚好抵销,等於是经过 BN 层後,Regularizers 对输出毫无影响。
那麽,如果是这样子,我们还需要 Regularizers 做什麽?其实就算 Regularizers 的效果被 BN 约掉,他仍然可以拿来约束权重的大小,这反而让 Regularizers 有了一个新的定位!
综上所述,我们分别来训练并观察:
实验一:仅用 BN
def bottleneck(net, filters, out_ch, strides, shortcut=True, zero_pad=False):
padding = 'valid' if zero_pad else 'same'
shortcut_net = net
net = tf.keras.layers.Conv2D(filters * 6, 1, use_bias=False, padding='same')(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
if zero_pad:
net = tf.keras.layers.ZeroPadding2D(padding=((0, 1), (0, 1)))(net)
net = tf.keras.layers.DepthwiseConv2D(3, strides=strides, use_bias=False, padding=padding)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(out_ch, 1, use_bias=False, padding='same')(net)
net = tf.keras.layers.BatchNormalization()(net)
if shortcut:
net = tf.keras.layers.Add()([net, shortcut_net])
return net
def get_mobilenetV2_bn(shape):
input_node = tf.keras.layers.Input(shape=shape)
net = tf.keras.layers.Conv2D(32, 3, (2, 2), use_bias=False, padding='same')(input_node)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.DepthwiseConv2D(3, use_bias=False, padding='same')(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(16, 1, use_bias=False, padding='same')(net)
net = tf.keras.layers.BatchNormalization()(net)
net = bottleneck(net, 16, 24, (2, 2), shortcut=False, zero_pad=True) # block_1
net = bottleneck(net, 24, 24, (1, 1), shortcut=True) # block_2
net = bottleneck(net, 24, 32, (2, 2), shortcut=False, zero_pad=True) # block_3
net = bottleneck(net, 32, 32, (1, 1), shortcut=True) # block_4
net = bottleneck(net, 32, 32, (1, 1), shortcut=True) # block_5
net = bottleneck(net, 32, 64, (2, 2), shortcut=False, zero_pad=True) # block_6
net = bottleneck(net, 64, 64, (1, 1), shortcut=True) # block_7
net = bottleneck(net, 64, 64, (1, 1), shortcut=True) # block_8
net = bottleneck(net, 64, 64, (1, 1), shortcut=True) # block_9
net = bottleneck(net, 64, 96, (1, 1), shortcut=False) # block_10
net = bottleneck(net, 96, 96, (1, 1), shortcut=True) # block_11
net = bottleneck(net, 96, 96, (1, 1), shortcut=True) # block_12
net = bottleneck(net, 96, 160, (2, 2), shortcut=False, zero_pad=True) # block_13
net = bottleneck(net, 160, 160, (1, 1), shortcut=True) # block_14
net = bottleneck(net, 160, 160, (1, 1), shortcut=True) # block_15
net = bottleneck(net, 160, 320, (1, 1), shortcut=False) # block_16
net = tf.keras.layers.Conv2D(1280, 1, use_bias=False, padding='same')(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
return input_node, net
input_node, net = get_mobilenetV2_bn((224,224,3))
net = tf.keras.layers.GlobalAveragePooling2D()(net)
net = tf.keras.layers.Dense(NUM_OF_CLASS)(net)
model = tf.keras.Model(inputs=[input_node], outputs=[net])
model.compile(
optimizer=tf.keras.optimizers.SGD(LR),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(
ds_train,
epochs=EPOCHS,
validation_data=ds_test,
verbose=True,
callbacks=[PrintWeightsCallback()])
Epoch 1/10
conv2d_70 layer: [-0.02620085 0.10190549 -0.11347758 -0.120565 0.10449789 0.09324147 -0.02767587 0.03863515 0.12804998 0.10915002]
(略)
Epoch 10/10
conv2d_70 layer: [-0.17744583 0.17197324 -0.25094765 -0.4260909 -0.2861711 0.12653658 -0.18487974 0.10149723 -0.08124655 0.30025354]
可以看到模型准确度有上升,而模型的权重的大小以0.x居多。
实验二:仅用 L2 Regularizers
def bottleneck(net, filters, out_ch, strides, regularizer, shortcut=True, zero_pad=False):
padding = 'valid' if zero_pad else 'same'
shortcut_net = net
net = tf.keras.layers.Conv2D(filters * 6, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
if zero_pad:
net = tf.keras.layers.ZeroPadding2D(padding=((0, 1), (0, 1)))(net)
net = tf.keras.layers.DepthwiseConv2D(3, strides=strides, use_bias=False, padding=padding, depthwise_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(out_ch, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
if shortcut:
net = tf.keras.layers.Add()([net, shortcut_net])
return net
def get_mobilenetV2_l2(shape, regularizer):
input_node = tf.keras.layers.Input(shape=shape)
net = tf.keras.layers.Conv2D(32, 3, (2, 2), use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(input_node)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.DepthwiseConv2D(3, use_bias=False, padding='same', depthwise_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(16, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = bottleneck(net, 16, 24, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_1
net = bottleneck(net, 24, 24, (1, 1), regularizer, shortcut=True) # block_2
net = bottleneck(net, 24, 32, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_3
net = bottleneck(net, 32, 32, (1, 1), regularizer, shortcut=True) # block_4
net = bottleneck(net, 32, 32, (1, 1), regularizer, shortcut=True) # block_5
net = bottleneck(net, 32, 64, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_6
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_7
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_8
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_9
net = bottleneck(net, 64, 96, (1, 1), regularizer, shortcut=False) # block_10
net = bottleneck(net, 96, 96, (1, 1), regularizer, shortcut=True) # block_11
net = bottleneck(net, 96, 96, (1, 1), regularizer, shortcut=True) # block_12
net = bottleneck(net, 96, 160, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_13
net = bottleneck(net, 160, 160, (1, 1), regularizer, shortcut=True) # block_14
net = bottleneck(net, 160, 160, (1, 1), regularizer, shortcut=True) # block_15
net = bottleneck(net, 160, 320, (1, 1), regularizer, shortcut=False) # block_16
net = tf.keras.layers.Conv2D(1280, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
return input_node, net
REGULARIZER=tf.keras.regularizers.l2(0.001)
input_node, net = get_mobilenetV2_l2((224,224,3), REGULARIZER)
net = tf.keras.layers.GlobalAveragePooling2D()(net)
net = tf.keras.layers.Dense(NUM_OF_CLASS, kernel_regularizer=REGULARIZER, bias_regularizer=REGULARIZER)(net)
model = tf.keras.Model(inputs=[input_node], outputs=[net])
model.compile(
optimizer=tf.keras.optimizers.SGD(LR),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(
ds_train,
epochs=EPOCHS,
validation_data=ds_test,
verbose=True,
callbacks=[PrintWeightsCallback()])
产出:
Epoch 1/10
conv2d_140 layer: [ 0.13498078 -0.12406565 0.05617869 0.1112484 0.0101945 -0.09676153 0.05650736 0.03752249 -0.01030101 0.08091953]
(略)
Epoch 10/10
conv2d_140 layer: [ 0.00809666 -0.00744193 0.00336981 0.00667309 0.0006115 -0.00580412 0.00338953 0.00225074 -0.00061789 0.00485385]
准确度没上升,但是权重有受到 Regularizers 的影响,最後的大小在 0.00X附近。
实验三:同时使用 BN 与 L2 Regularizers
def bottleneck(net, filters, out_ch, strides, regularizer, shortcut=True, zero_pad=False):
padding = 'valid' if zero_pad else 'same'
shortcut_net = net
net = tf.keras.layers.Conv2D(filters * 6, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
if zero_pad:
net = tf.keras.layers.ZeroPadding2D(padding=((0, 1), (0, 1)))(net)
net = tf.keras.layers.DepthwiseConv2D(3, strides=strides, use_bias=False, padding=padding, depthwise_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(out_ch, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
if shortcut:
net = tf.keras.layers.Add()([net, shortcut_net])
return net
def get_mobilenetV2_bn_l2(shape, regularizer):
input_node = tf.keras.layers.Input(shape=shape)
net = tf.keras.layers.Conv2D(32, 3, (2, 2), use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(input_node)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.DepthwiseConv2D(3, use_bias=False, padding='same', depthwise_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
net = tf.keras.layers.Conv2D(16, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = bottleneck(net, 16, 24, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_1
net = bottleneck(net, 24, 24, (1, 1), regularizer, shortcut=True) # block_2
net = bottleneck(net, 24, 32, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_3
net = bottleneck(net, 32, 32, (1, 1), regularizer, shortcut=True) # block_4
net = bottleneck(net, 32, 32, (1, 1), regularizer, shortcut=True) # block_5
net = bottleneck(net, 32, 64, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_6
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_7
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_8
net = bottleneck(net, 64, 64, (1, 1), regularizer, shortcut=True) # block_9
net = bottleneck(net, 64, 96, (1, 1), regularizer, shortcut=False) # block_10
net = bottleneck(net, 96, 96, (1, 1), regularizer, shortcut=True) # block_11
net = bottleneck(net, 96, 96, (1, 1), regularizer, shortcut=True) # block_12
net = bottleneck(net, 96, 160, (2, 2), regularizer, shortcut=False, zero_pad=True) # block_13
net = bottleneck(net, 160, 160, (1, 1), regularizer, shortcut=True) # block_14
net = bottleneck(net, 160, 160, (1, 1), regularizer, shortcut=True) # block_15
net = bottleneck(net, 160, 320, (1, 1), regularizer, shortcut=False) # block_16
net = tf.keras.layers.Conv2D(1280, 1, use_bias=False, padding='same', kernel_regularizer=regularizer, bias_regularizer=regularizer)(net)
net = tf.keras.layers.BatchNormalization()(net)
net = tf.keras.layers.ReLU(max_value=6)(net)
return input_node, net
REGULARIZER=tf.keras.regularizers.l2(0.001)
input_node, net = get_mobilenetV2_bn_l2((224,224,3), REGULARIZER)
net = tf.keras.layers.GlobalAveragePooling2D()(net)
net = tf.keras.layers.Dense(NUM_OF_CLASS, kernel_regularizer=REGULARIZER, bias_regularizer=REGULARIZER)(net)
model = tf.keras.Model(inputs=[input_node], outputs=[net])
model.compile(
optimizer=tf.keras.optimizers.SGD(LR),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(
ds_train,
epochs=EPOCHS,
validation_data=ds_test,
verbose=True,
callbacks=[PrintWeightsCallback()])
产出:
Epoch 1/10
conv2d layer: [ 0.08555499 0.12598534 -0.03244241 0.11502795 0.05982415 0.11824064 0.13084657 0.08825392 0.02584527 -0.0615561 ]
(略)
Epoch 10/10
conv2d layer: [-0.12869535 -0.09060557 0.01533028 -0.05165869 -0.00263854 -0.06649228 -0.02129784 0.09801372 0.12735689 0.0331211 ]
准确度有上升,权重也有比实验一有压制的感觉,虽然最大有到0.12,但多数为0.0X。
如同昨天说到,我自己实务上很少套用 Regularizers 在训练任务上,虽然实验三发现 Regularizers 可以把权重压小,但是实验一使用 BN 後的权重数值也还在可接受的区间。
ref:
https://blog.janestreet.com/l2-regularization-and-batch-norm/
>>: Day 27 CSS3 < 动画 animation>
记得第一份工作时候,有个同事一直在那边嚷嚷静态方法比非静态效率高,偶而喊喊抽象化思考很重要,我当时心...
矩阵的相关处理 目录: 0.前言 1.矩阵设置 2.矩阵相乘 3.稀疏矩阵 4.稀疏矩阵的普通转置 ...
在认识动态规划之前先来理解Divide and Conquer(分治法)吧!Divide and ...
创建App-联络客服 由於本App的联络客服功能的延伸界面使用最简介的方式来设计,因为本App的原始...
我会听很多 podcast ,看业界有名人士的文章、推特 但跟大企业的成功故事比起来,我常常觉得小产...