#

Group Normalization

This is a PyTorch implementation of the Group Normalization paper.

Batch Normalization works well for large enough batch sizes but not well for small batch sizes, because it normalizes over the batch. Training large models with large batch sizes is not possible due to the memory capacity of the devices.

This paper introduces Group Normalization, which normalizes a set of features together as a group. This is based on the observation that classical features such as SIFT and HOG are group-wise features. The paper proposes dividing feature channels into groups and then separately normalizing all channels within each group.

Formulation

All normalization layers can be defined by the following computation.

$\overset{x}{^}_{i} = \frac{1}{σ _{i}} (x_{i} - μ_{i})$

where $x$ is the tensor representing the batch, and $i$ is the index of a single value. For instance, when it's 2D images $i = (i_{N}, i_{C}, i_{H}, i_{W})$ is a 4-d vector for indexing image within batch, feature channel, vertical coordinate and horizontal coordinate. $μ_{i}$ and $σ_{i}$ are mean and standard deviation.

μ_{i} σ_{i} = \frac{1}{m} k \in S_{i} \sum x_{k} = \frac{1}{m} k \in S_{i} \sum (x_{k} - μ_{i})^{2} + ϵ

$S_{i}$ is the set of indexes across which the mean and standard deviation are calculated for index $i$ . $m$ is the size of the set $S_{i}$ which is the same for all $i$ .

The definition of $S_{i}$ is different for Batch normalization, Layer normalization, and Instance normalization.

Batch Normalization

$S_{i} = {k ∣ k_{C} = i_{C}}$

The values that share the same feature channel are normalized together.

Layer Normalization

$S_{i} = {k ∣ k_{N} = i_{N}}$

The values from the same sample in the batch are normalized together.

Instance Normalization

$S_{i} = {k ∣ k_{N} = i_{N}, k_{C} = i_{C}}$

The values from the same sample and same feature channel are normalized together.

Group Normalization

$S_{i} = {k ∣ k_{N} = i_{N}, ⌊ \frac{k _{C}}{C / G} ⌋ = ⌊ \frac{i _{C}}{C / G} ⌋}$

where $G$ is the number of groups and $C$ is the number of channels.

Group normalization normalizes values of the same sample and the same group of channels together.

Here's a CIFAR 10 classification model that uses instance normalization.

84import torch
85from torch import nn
86
87from labml_helpers.module import Module

#

Group Normalization Layer

90class GroupNorm(Module):

#

groups is the number of groups the features are divided into
channels is the number of features in the input
eps is $ϵ$ , used in $Va r [x^{(k)}] + ϵ$ for numerical stability
affine is whether to scale and shift the normalized value

95    def __init__(self, groups: int, channels: int, *,
96                 eps: float = 1e-5, affine: bool = True):

#

103        super().__init__()
104
105        assert channels % groups == 0, "Number of channels should be evenly divisible by the number of groups"
106        self.groups = groups
107        self.channels = channels
108
109        self.eps = eps
110        self.affine = affine

#

Create parameters for $γ$ and $β$ for scale and shift

112        if self.affine:
113            self.scale = nn.Parameter(torch.ones(channels))
114            self.shift = nn.Parameter(torch.zeros(channels))

#

x is a tensor of shape [batch_size, channels, *] . * denotes any number of (possibly 0) dimensions. For example, in an image (2D) convolution this will be [batch_size, channels, height, width]

116    def forward(self, x: torch.Tensor):

#

Keep the original shape

124        x_shape = x.shape

#

Get the batch size

126        batch_size = x_shape[0]

#

Sanity check to make sure the number of features is the same

128        assert self.channels == x.shape[1]

#

Reshape into [batch_size, groups, n]

131        x = x.view(batch_size, self.groups, -1)

#

Calculate the mean across last dimension; i.e. the means for each sample and channel group $E [x_{(i_{N}, i_{G})}]$

135        mean = x.mean(dim=[-1], keepdim=True)

#

Calculate the squared mean across last dimension; i.e. the means for each sample and channel group $E [x_{(i_{N}, i_{G})}^{2}]$

138        mean_x2 = (x ** 2).mean(dim=[-1], keepdim=True)

#

Variance for each sample and feature group $Va r [x_{(i_{N}, i_{G})}] = E [x_{(i_{N}, i_{G})}^{2}] - E [x_{(i_{N}, i_{G})}]^{2}$

141        var = mean_x2 - mean ** 2

#

Normalize $\overset{x}{^}_{(i_{N}, i_{G})} = \frac{x _{(i_{N}, i_{G})} - E [ x _{(i_{N}, i_{G})} ]}{Va r [ x _{(i_{N}, i_{G})} ] + ϵ}$

146        x_norm = (x - mean) / torch.sqrt(var + self.eps)

#

Scale and shift channel-wise $y_{i_{C}} = γ_{i_{C}} \overset{x}{^}_{i_{C}} + β_{i_{C}}$

150        if self.affine:
151            x_norm = x_norm.view(batch_size, self.channels, -1)
152            x_norm = self.scale.view(1, -1, 1) * x_norm + self.shift.view(1, -1, 1)

#

Reshape to original and return

155        return x_norm.view(x_shape)

#

Simple test

158def _test():

#

162    from labml.logger import inspect
163
164    x = torch.zeros([2, 6, 2, 4])
165    inspect(x.shape)
166    bn = GroupNorm(2, 6)
167
168    x = bn(x)
169    inspect(x.shape)

#

173if __name__ == '__main__':
174    _test()