This is a PyTorch implementation of Batch-Channel Normalization from the paper Micro-Batch Training with Batch-Channel Normalization and Weight Standardization. We also have an annotated implementation of Weight Standardization.
Batch-Channel Normalization performs batch normalization followed by a channel normalization (similar to a Group Normalization. When the batch size is small a running mean and variance is used for batch normalization.
Here is the training code for training a VGG network that uses weight standardization to classify CIFAR-10 data.
26import torch
27from torch import nn
28
29from labml_helpers.module import Module
30from labml_nn.normalization.batch_norm import BatchNorm
This first performs a batch normalization - either normal batch norm or a batch norm with estimated mean and variance (exponential mean/variance over multiple batches). Then a channel normalization performed.
33class BatchChannelNorm(Module):
channels
is the number of features in the input groups
is the number of groups the features are divided into eps
is , used in for numerical stability momentum
is the momentum in taking the exponential moving average estimate
is whether to use running mean and variance for batch norm43 def __init__(self, channels: int, groups: int,
44 eps: float = 1e-5, momentum: float = 0.1, estimate: bool = True):
52 super().__init__()
Use estimated batch norm or normal batch norm.
55 if estimate:
56 self.batch_norm = EstimatedBatchNorm(channels,
57 eps=eps, momentum=momentum)
58 else:
59 self.batch_norm = BatchNorm(channels,
60 eps=eps, momentum=momentum)
Channel normalization
63 self.channel_norm = ChannelNorm(channels, groups, eps)
65 def forward(self, x):
66 x = self.batch_norm(x)
67 return self.channel_norm(x)
When input is a batch of image representations, where is the batch size, is the number of channels, is the height and is the width. and .
where,
are the running mean and variances. is the momentum for calculating the exponential mean.
70class EstimatedBatchNorm(Module):
channels
is the number of features in the input eps
is , used in for numerical stability momentum
is the momentum in taking the exponential moving average estimate
is whether to use running mean and variance for batch norm91 def __init__(self, channels: int,
92 eps: float = 1e-5, momentum: float = 0.1, affine: bool = True):
99 super().__init__()
100
101 self.eps = eps
102 self.momentum = momentum
103 self.affine = affine
104 self.channels = channels
Channel wise transformation parameters
107 if self.affine:
108 self.scale = nn.Parameter(torch.ones(channels))
109 self.shift = nn.Parameter(torch.zeros(channels))
Tensors for and
112 self.register_buffer('exp_mean', torch.zeros(channels))
113 self.register_buffer('exp_var', torch.ones(channels))
x
is a tensor of shape [batch_size, channels, *]
. *
denotes any number of (possibly 0) dimensions. For example, in an image (2D) convolution this will be [batch_size, channels, height, width]
115 def forward(self, x: torch.Tensor):
Keep old shape
123 x_shape = x.shape
Get the batch size
125 batch_size = x_shape[0]
Sanity check to make sure the number of features is correct
128 assert self.channels == x.shape[1]
Reshape into [batch_size, channels, n]
131 x = x.view(batch_size, self.channels, -1)
Update and in training mode only
134 if self.training:
No backpropagation through and
136 with torch.no_grad():
Calculate the mean across first and last dimensions;
139 mean = x.mean(dim=[0, 2])
Calculate the squared mean across first and last dimensions;
142 mean_x2 = (x ** 2).mean(dim=[0, 2])
Variance for each feature
145 var = mean_x2 - mean ** 2
153 self.exp_mean = (1 - self.momentum) * self.exp_mean + self.momentum * mean
154 self.exp_var = (1 - self.momentum) * self.exp_var + self.momentum * var
Normalize
158 x_norm = (x - self.exp_mean.view(1, -1, 1)) / torch.sqrt(self.exp_var + self.eps).view(1, -1, 1)
Scale and shift
163 if self.affine:
164 x_norm = self.scale.view(1, -1, 1) * x_norm + self.shift.view(1, -1, 1)
Reshape to original and return
167 return x_norm.view(x_shape)
This is similar to Group Normalization but affine transform is done group wise.
170class ChannelNorm(Module):
groups
is the number of groups the features are divided into channels
is the number of features in the input eps
is , used in for numerical stability affine
is whether to scale and shift the normalized value177 def __init__(self, channels, groups,
178 eps: float = 1e-5, affine: bool = True):
185 super().__init__()
186 self.channels = channels
187 self.groups = groups
188 self.eps = eps
189 self.affine = affine
Parameters for affine transformation.
Note that these transforms are per group, unlike in group norm where they are transformed channel-wise.
194 if self.affine:
195 self.scale = nn.Parameter(torch.ones(groups))
196 self.shift = nn.Parameter(torch.zeros(groups))
x
is a tensor of shape [batch_size, channels, *]
. *
denotes any number of (possibly 0) dimensions. For example, in an image (2D) convolution this will be [batch_size, channels, height, width]
198 def forward(self, x: torch.Tensor):
Keep the original shape
207 x_shape = x.shape
Get the batch size
209 batch_size = x_shape[0]
Sanity check to make sure the number of features is the same
211 assert self.channels == x.shape[1]
Reshape into [batch_size, groups, n]
214 x = x.view(batch_size, self.groups, -1)
Calculate the mean across last dimension; i.e. the means for each sample and channel group
218 mean = x.mean(dim=[-1], keepdim=True)
Calculate the squared mean across last dimension; i.e. the means for each sample and channel group
221 mean_x2 = (x ** 2).mean(dim=[-1], keepdim=True)
Variance for each sample and feature group
224 var = mean_x2 - mean ** 2
Normalize
229 x_norm = (x - mean) / torch.sqrt(var + self.eps)
Scale and shift group-wise
233 if self.affine:
234 x_norm = self.scale.view(1, -1, 1) * x_norm + self.shift.view(1, -1, 1)
Reshape to original and return
237 return x_norm.view(x_shape)