#

An Attention Free Transformer

This is a PyTorch implementation of the paper An Attention Free Transformer.

This paper replaces the self-attention layer with a new efficient operation, that has memory complexity of $O (T d)$ , where $T$ is the sequence length and $d$ is the dimensionality of embeddings.

The paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model.

Attention Free Transformer

AFT (similar to MHA) first transforms the embeddings $X$ into query $Q = X W^{Q}$ , key $K = X W^{K}$ and value $V = X W^{V}$ tensors with learned weights. The output for each position $t \in [1, T]$ is calculated with the following operation.

$Y_{t} = σ (Q_{t}) ⊙ \frac{\sum _{t^{'} = 1}^{T} e x p ( K _{t^{'}} + w _{t, t^{'}} ) ⊙ V _{t^{'}}}{\sum _{t^{'} = 1}^{T} e x p ( K _{t^{'}} + w _{t, t^{'}} )}$

, where $⊙$ is element-wise product, $σ$ is a non-linearity (sigmoid) and $w \in R^{T \times T}$ is a learned matrix of pair-wise position biases.

This means that we take the weighted average of values and multiply them by the query. This eliminates the need to calculate the $T \times T$ attention matrix that MHA requires, and therefore reduce the memory requirement.

AFT Local

AFT Local only apply learned pair-wise position biases locally:

w_{t, t^{'}}^{'} = {w_{t, t^{'}}, 0, for ∣ t - t^{'} ∣ < s otherwi s e

, where $s \leq T$ is the local window size.

Although $w_{t, t^{'}}^{'}$ is $0$ outside the local window the AFT operation still uses key-value pairs from other areas. This is different from local transformers where embeddings outside the local window are completely not visible.

Here is the training code for a AFT Local model.

59from typing import Optional
60
61import torch
62from torch import nn
63
64from labml_helpers.module import Module

#

AFT Local Operation

$Y_{t} = σ (Q_{t}) ⊙ \frac{\sum _{t^{'} = 1}^{T} e x p ( K _{t^{'}} + w _{t, t^{'}} ) ⊙ V _{t^{'}}}{\sum _{t^{'} = 1}^{T} e x p ( K _{t^{'}} + w _{t, t^{'}} )}$

where,

w_{t, t^{'}}^{'} = {w_{t, t^{'}}, 0, for ∣ t - t^{'} ∣ < s otherwi s e

67class AFTLocal(Module):

#

d_model is the number of features in the query , key and value vectors.
seq_len is $T$
local_window_size is the local window size $s$
bias is whether to have a bias parameter for transformations for $Q$ , $K$ and $V$ .

86    def __init__(self, d_model: int, seq_len: int, local_window_size: int, bias: bool = True):

#

94        super().__init__()

#

Local window size $s$

97        self.local_window_size = local_window_size

#

These transform the query , key and value vectors.

99        self.query = nn.Linear(d_model, d_model, bias=bias)
100        self.key = nn.Linear(d_model, d_model, bias=bias)
101        self.value = nn.Linear(d_model, d_model, bias=bias)

#

Pair-wise positional biases $w \in R^{T \times T}$

103        self.pos_bias = nn.Parameter(torch.zeros(seq_len, seq_len), requires_grad=True)

#

Mask for $w_{t, t^{'}}$

105        self.local_mask = nn.Parameter(self.create_local_mask(seq_len, local_window_size), requires_grad=False)

#

Activation $σ$

107        self.activation = nn.Sigmoid()

#

Output layer

109        self.output = nn.Linear(d_model, d_model)

#

Create local mask

This creates a mask for

m_{t, t^{'}} = {1, 0, for ∣ t - t^{'} ∣ < s otherwi s e

111    @staticmethod
112    def create_local_mask(seq_len, local_window_size):

#

Initialize to ones

128        local_mask = torch.ones(seq_len, seq_len, dtype=torch.bool)

#

Make $t^{'} - t \geq s$ zero

130        local_mask = torch.tril(local_mask, local_window_size - 1)

#

Make $t - t^{'} \geq s$ zero

132        local_mask = torch.triu(local_mask, -(local_window_size - 1))

#

135        return local_mask

#

query , key and value are the tensors that store collection of token embeddings for query, key and value. They have shape [seq_len, batch_size, d_model] .

mask has shape [seq_len, seq_len, batch_size] and mask[i, j, b] indicates whether for batch b , query at position i has access to key-value at position j .

137    def forward(self, *,
138                query: torch.Tensor,
139                key: torch.Tensor,
140                value: torch.Tensor,
141                mask: Optional[torch.Tensor] = None):

#

query , key and value have shape [seq_len, batch_size, d_model]

153        seq_len, _, _ = query.shape
154
155        if mask is not None:

#

mask has shape [seq_len_q, seq_len_k, batch_size] , where first dimension is the query dimension. If the query dimension is equal to $1$ it will be broadcasted.

159            assert mask.shape[0] == 1 or mask.shape[0] == query.shape[0]
160            assert mask.shape[1] == key.shape[0]
161            assert mask.shape[2] == 1 or mask.shape[2] == query.shape[1]

#

Transform query, key and value embeddings

164        query = self.query(query)
165        key = self.key(key)
166        value = self.value(value)

#

Get

w_{t, t^{'}}^{'} = {w_{t, t^{'}}, 0, for ∣ t - t^{'} ∣ < s otherwi s e

using the mask

179        pos_bias = self.pos_bias[:seq_len, :seq_len] * self.local_mask[:seq_len, :seq_len]
180        pos_bias = pos_bias.unsqueeze(-1)
181        pos_bias.masked_fill_(~mask, float('-inf'))

#

Y_{t} = σ (Q_{t}) ⊙ \frac{\sum _{t^{'} = 1}^{T} exp ( K _{t^{'}} + w _{t, t^{'}} ) ⊙ V _{t^{'}}}{\sum _{t^{'} = 1}^{T} exp ( K _{t^{'}} + w _{t, t^{'}} )} = σ (Q_{t}) ⊙ \frac{\sum _{t^{'} = 1}^{T} e x p ( w _{t, t^{'}} ) ⊙ e x p ( K _{t^{'}} ) ⊙ V _{t^{'}}}{\sum _{t^{'} = 1}^{T} e x p ( w _{t, t^{'}} ) ⊙ e x p ( K _{t^{'}} )}

We compute $e x p (w_{t, t^{'}})$ , $e x p (K_{t^{'}}) ⊙ V_{t^{'}}$ and $e x p (K_{t^{'}})$ separately and do a matrix multiplication. We use einsum for clarity.

#

We subtract $m a x_{t^{'}} (K_{t^{'}})$ and $m a x_{t^{'}} (w_{t, t^{'}})$ before calculating the exponents to stabilize the softmax calculation.

If $x_{i}$ is large $e x p (x_{i})$ becomes huge and the computation of $\frac{\sum e x p ( x _{i} ) y _{i}}{\sum e x p ( x _{i} )}$ becomes unstable. Subtracting a constant before calculating the exponent from numerator and denominator will cancel out. and can help stabilize the computation. So we subtract $max (x_{i})$ to stabilize the computation.

203        max_key = key.max(dim=0, keepdims=True)[0]
204        max_pos_bias = pos_bias.max(dim=1,  keepdims=True)[0]

#

$exp (K_{t^{'}} - m a x_{t^{'}} (K_{t^{'}}))$

207        exp_key = torch.exp(key - max_key)

#

$exp (w_{t, t^{'}} - m a x_{t^{'}} (w_{t, t^{'}}))$

209        exp_pos_bias = torch.exp(pos_bias - max_pos_bias)

#

The numerator part $\sum_{t^{'} = 1}^{T} e x p (w_{t, t^{'}}) ⊙ e x p (K_{t^{'}}) ⊙ V_{t^{'}}$

212        num = torch.einsum('ijb,jbd->ibd', exp_pos_bias, exp_key * value)

#

The denominator part $\sum_{t^{'} = 1}^{T} e x p (w_{t, t^{'}}) ⊙ e x p (K_{t^{'}})$

214        den = torch.einsum('ijb,jbd->ibd', exp_pos_bias, exp_key)

#

Output $Y_{t} = σ (Q_{t}) ⊙ \frac{\sum _{t^{'} = 1}^{T} e x p ( w _{t, t^{'}} ) ⊙ e x p ( K _{t^{'}} ) ⊙ V _{t^{'}}}{\sum _{t^{'} = 1}^{T} e x p ( w _{t, t^{'}} ) ⊙ e x p ( K _{t^{'}} )}$

219        y = self.activation(query) * num / den

#

Output layer

222        return self.output(y)

#

Test local mask

225def _test_local_mask():

#

229    from labml.logger import inspect
230    inspect(AFTLocal.create_local_mask(10, 4))

#

234if __name__ == '__main__':
235    _test_local_mask()