Weight Standardization

This is a PyTorch implementation of Weight Standardization from the paper Micro-Batch Training with Batch-Channel Normalization and Weight Standardization. We also have an annotated implementation of Batch-Channel Normalization.

Batch normalization gives a smooth loss landscape and avoids elimination singularities. Elimination singularities are nodes of the network that become useless (e.g. a ReLU that gives 0 all the time).

However, batch normalization doesn't work well when the batch size is too small, which happens when training large networks because of device memory limitations. The paper introduces Weight Standardization with Batch-Channel Normalization as a better alternative.

Weight Standardization: 1. Normalizes the gradients 2. Smoothes the landscape (reduced Lipschitz constant) 3. Avoids elimination singularities

The Lipschitz constant is the maximum slope a function has between two points. That is, $L$ is the Lipschitz constant where $L$ is the smallest value that satisfies, $\forall a, b \in A : ∥ f (a) - f (b)∥ \leq L ∥ a - b ∥$ where $f : A \to R^{m}, A \in R^{n}$ .

Elimination singularities are avoided because it keeps the statistics of the outputs similar to the inputs. So as long as the inputs are normally distributed the outputs remain close to normal. This avoids outputs of nodes from always falling beyond the active range of the activation function (e.g. always negative input for a ReLU).

Refer to the paper for proofs.

Here is the training code for training a VGG network that uses weight standardization to classify CIFAR-10 data. This uses a 2D-Convolution Layer with Weight Standardization.

48import torch

#

Weight Standardization

$\hat{W}_{i, j} = \frac{W _{i, j} - μ _{W_{i, \cdot}}}{σ _{W_{i, \cdot}}}$

where,

W μ_{W_{i, \cdot}} σ_{W_{i, \cdot}} \in R^{O \times I} = \frac{1}{I} j = 1 \sum I W_{i, j} = \frac{1}{I} j = 1 \sum I W_{i, j}^{2} - μ_{W_{i, \cdot}}^{2} + ϵ

for a 2D-convolution layer $O$ is the number of output channels ( $O = C_{o u t}$ ) and $I$ is the number of input channels times the kernel size ( $I = C_{in} \times k_{H} \times k_{W}$ )

51def weight_standardization(weight: torch.Tensor, eps: float):

#

Get $C_{o u t}$ , $C_{in}$ and kernel shape

70    c_out, c_in, *kernel_shape = weight.shape

#

Reshape $W$ to $O \times I$

72    weight = weight.view(c_out, -1)

#

Calculate

μ_{W_{i, \cdot}} σ_{W_{i, \cdot}}^{2} = \frac{1}{I} j = 1 \sum I W_{i, j} = \frac{1}{I} j = 1 \sum I W_{i, j}^{2} - μ_{W_{i, \cdot}}^{2}

79    var, mean = torch.var_mean(weight, dim=1, keepdim=True)

#

Normalize $\hat{W}_{i, j} = \frac{W _{i, j} - μ _{W_{i, \cdot}}}{σ _{W_{i, \cdot}}}$

82    weight = (weight - mean) / (torch.sqrt(var + eps))

#

Change back to original shape and return

84    return weight.view(c_out, c_in, *kernel_shape)