#

Position-wise Feed-Forward Network (FFN)

This is a PyTorch implementation of position-wise feedforward network used in transformer.

FFN consists of two fully connected layers. Number of dimensions in the hidden layer $d_{f f}$ , is generally set to around four times that of the token embedding $d_{m o d e l}$ . So it is sometime also called the expand-and-contract network.

There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, $max (0, x)$

That is, the FFN function is, $FFN (x, W_{1}, W_{2}, b_{1}, b_{2}) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$ where $W_{1}$ , $W_{2}$ , $b_{1}$ and $b_{2}$ are learnable parameters.

Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. $x Φ (x)$ where $Φ (x) = P (X \leq x), X \sim N (0, 1)$

Gated Linear Units

This is a generic implementation that supports different variants including Gated Linear Units (GLU). We have also implemented experiments on these:

38import torch
39from torch import nn

#

FFN module

43class FeedForward(nn.Module):

#

d_model is the number of features in a token embedding
d_ff is the number of features in the hidden layer of the FFN
dropout is dropout probability for the hidden layer
is_gated specifies whether the hidden layer is gated
bias1 specified whether the first fully connected layer should have a learnable bias
bias2 specified whether the second fully connected layer should have a learnable bias
bias_gate specified whether the fully connected layer for the gate should have a learnable bias

48    def __init__(self, d_model: int, d_ff: int,
49                 dropout: float = 0.1,
50                 activation=nn.ReLU(),
51                 is_gated: bool = False,
52                 bias1: bool = True,
53                 bias2: bool = True,
54                 bias_gate: bool = True):

#

64        super().__init__()

#

Layer one parameterized by weight $W_{1}$ and bias $b_{1}$

66        self.layer1 = nn.Linear(d_model, d_ff, bias=bias1)

#

Layer one parameterized by weight $W_{1}$ and bias $b_{1}$

68        self.layer2 = nn.Linear(d_ff, d_model, bias=bias2)

#

Hidden layer dropout

70        self.dropout = nn.Dropout(dropout)

#

Activation function $f$

72        self.activation = activation

#

Whether there is a gate

74        self.is_gated = is_gated
75        if is_gated:

#

If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight $V$ and bias $c$

78            self.linear_v = nn.Linear(d_model, d_ff, bias=bias_gate)

#

80    def forward(self, x: torch.Tensor):

#

$f (x W_{1} + b_{1})$

82        g = self.activation(self.layer1(x))

#

If gated, $f (x W_{1} + b_{1}) \otimes (x V + b)$

84        if self.is_gated:
85            x = g * self.linear_v(x)

#

Otherwise

87        else:
88            x = g

#

Apply dropout

90        x = self.dropout(x)

#

$(f (x W_{1} + b_{1}) \otimes (x V + b)) W_{2} + b_{2}$ or $f (x W_{1} + b_{1}) W_{2} + b_{2}$ depending on whether it is gated

93        return self.layer2(x)