#

Fast weights transformer

The paper Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch finds similarities between linear self-attention and fast weight systems and makes modifications to self-attention update rule based on that. It also introduces a simpler, yet effective kernel function.

The authors have provided an official implementation of the paper including other variants they compare with in the paper.

Fast weights

Consider a sequence of inputs ${x^{(i)}}_{i = 1}^{L}$ or length $L$ and each step is a vector of size $d_{i n}$ ; i.e. $x \in R^{d_{i n}}$ . The fast weight model generates a weight matrix at each step to produce output ${y^{(i)}}_{i = 1}^{L}$ , $y \in R^{d_{o u t}}$

a^{(i)}, b^{(i)} W^{(i)} y^{(i)} = W_{a} x^{(i)}, W_{b} x^{(i)} = σ (W^{(i - 1)} + a^{(i)} \otimes b^{(i)}) = W^{(i)} x^{(i)}

$\otimes$ is the outer product ( $a \otimes b = a b^{⊤}$ ), where elements of the two vectors are multiplied with each other to give a matrix. $σ$ is an activation function. $W_{a}$ and $W_{b}$ are trainable weights (parameters). $W^{(i)}$ are the fast weights that are generated at each step.

Linear self-attention

Original transformer self-attention is, (omitting $\frac{1}{d _{k}}$ for clarity)

y^{(i)} = [v^{(1)}, v^{(2)}, ..., v^{(i)}] softma x ([k^{(1)}, k^{(2)}, ..., k^{(i)}]^{⊤} q^{(i)}) = j = 1 \sum i \frac{v ^{(j)} κ ( k ^{(j)} , q ^{(i)} )}{\sum _{j^{'} = 1}^{i} κ ( k ^{(j^{'})} , q ^{(i)} )}

where $κ (k, q) = e x p (k \cdot q)$

The idea behind linearizing self attention is to replace softmax kernel $κ$ with a different kernel $κ^{'}$ so that we can calculate the denominator of the self attention function faster:

$κ^{'} (k, q) = ϕ (k)^{⊤} ϕ (q)$

This gives

y^{(i)} = \frac{( \sum _{j = 1}^{i} v ^{(j)} \otimes ϕ ( k ^{(j)} ) ) ϕ ( q ^{(i)} )}{( \sum _{j^{'} = 1}^{i} ϕ ( k ^{(j^{'})} ) ) ϕ ( q ^{(i)} )}

With $W^{(i)} = \sum_{j = 1}^{i} v^{(j)} \otimes ϕ (k^{(j)})$ and $z^{(i)} = \sum_{j = 1}^{i} ϕ (k^{(j)})$ , we can calculate them efficiently:

W^{(i)} z^{(i)} y^{(i)} = W^{(i - 1)} + v^{(i)} \otimes ϕ (k^{(i)}) = z (i) + ϕ (k^{(i)}) = \frac{1}{z ^{(i)} \cdot ϕ ( q ^{(i)} )} W^{(i)} ϕ (q^{(i)})

This is quite similar to fast weights.

The paper introduces a new linear attention projection function $ϕ$ a new update rule for $W^{(i)} = f (W^{(i - 1)})$ and change the normalization $\frac{1}{z ^{(i)} \cdot ϕ ( q ^{(i)} )}$

Here are the training code and a notebook for training a fast weights transformer on the Tiny Shakespeare dataset.

95import torch
96from torch import nn
97
98from labml_nn.transformers.feed_forward import FeedForward
99from labml_nn.transformers.mha import PrepareForMultiHeadAttention
100from labml_nn.utils import clone_module_list

#

Deterministic Parameter Free Project (DPFP)

This is the new projection function $ϕ$ introduced in the paper. DPFP projects $k$ of dimensionality $d_{k ey}$ to dimensionality $d_{d o t} = 2 d_{k ey} ν$ , where $ν \in 1, 2, ..., 2 d_{k ey} - 1$ is a hyper-parameter.

$ϕ_{2 d_{k ey} (i - 1) + j} (k) = Re L U ([k, - k])_{j} Re L U ([k, - k])_{i + j}$

where $[k, - k]$ is the concatenation of $k$ and $- k$ to give a vector of size $2 d_{k ey}$ , $i \in 1, 2, ..., ν$ , and $j \in 1, 2, ..., 2 d_{k ey}$ . $x_{i}$ is the $i$ -th element of vector $x$ and is rolled around if $i$ is larger than the number of elements in $x$ .

Basically, it creates a new vector by multiplying elements of $[k, - k]$ shifted by $i$ .

This produces projections that are sparse (only a few elements of $p h i$ are non-zero) and orthogonal ( $ϕ (k^{(i)}) \cdot ϕ (k^{(j)}) \approx 0$ for most $i, j$ unless $k^{(i)}$ and $k^{(j)}$ are very similar.

Normalization

Paper introduces a simple normalization for $ϕ$ ,

$ϕ^{'} (k) = \frac{ϕ ( k )}{\sum _{j = 1}^{d_{d o t}} ϕ ( k ) _{j}}$

Check the paper for derivation.

103class DPFP(nn.Module):

#

nu is the hyper-parameter $ν$ .
eps is the small value used to make sure there is no division-by-zero when normalizing.

137    def __init__(self, nu: int = 1, eps: float = 1e-6):

#

142        super().__init__()
143        self.nu = nu
144        self.relu = nn.ReLU()
145        self.eps = eps

#

147    def forward(self, k: torch.Tensor):

#

Get $ϕ (k)$

149        k = self.dpfp(k)

#

Normalize by $\sum_{j = 1}^{d_{d o t}} ϕ (k)_{j}$

151        return k / (torch.sum(k, dim=-1, keepdim=True) + self.eps)

#

$ϕ (k)$

153    def dpfp(self, k: torch.Tensor):

#

$x = Re L U ([k, - k])$

158        x = self.relu(torch.cat([k, -k], dim=-1))

#

Shift and roll by $i \in 1, 2, ..., ν$ , to get $x_{i, j}^{'} = Re L U ([k, - k])_{i + j}$

161        x_rolled = [x.roll(shifts=i, dims=-1) for i in range(1, self.nu + 1)]

#

Concatenate to get $x_{2 d_{k ey} (i - 1) + j}^{'} = Re L U ([k, - k])_{i + j}$

164        x_rolled = torch.cat(x_rolled, dim=-1)

#

Concatenate copies of $x$

166        x_repeat = torch.cat([x] * self.nu, dim=-1)

#

Multiply them, $ϕ_{2 d_{k ey} (i - 1) + j} (k) = Re L U ([k, - k])_{j} Re L U ([k, - k])_{i + j}$

172        return x_repeat * x_rolled

#

Fast Weights Attention

The paper introduces a new update rule for calculating $W^{(i)}$ . The model first retrieves the current value $\overset{v}{ˉ}^{(i)}$ paired with the key $k^{(i)}$ . Then stores a combination $v^{(i)}_{n e w}$ of the retrieved value $\overset{v}{ˉ}^{(i)}$ and the input $v^{(i)}$ .

k^{(i)}, v^{(i)}, q^{(i)} \overset{v}{ˉ}^{(i)} β^{(i)} v^{(i)}_{n e w} W^{(i)} y^{(i)} = W_{k} x^{(i)}, W_{v} x^{(i)}, W_{q} x^{(i)} = W^{(i - 1)} ϕ^{'} (k^{(i)}) = σ (W_{β} x^{(i)}) = β^{(i)} v^{(i)} + (1 - β^{(i)}) \overset{v}{ˉ}^{(i)} = W^{(i - 1)} + v^{(i)}_{n e w} \otimes ϕ^{'} (k^{(i)}) = W^{(i - 1)} + β^{(i)} (v^{(i)} - \overset{v}{ˉ}^{(i)}) \otimes ϕ^{'} (k^{(i)}) = W^{(i)} ϕ^{'} (q^{(i)})

where $W_{β}$ is a trainable parameter and $σ$ is the sigmoid function.

Note that we don't need the normalization term $z$ because $ϕ^{'}$ is normalized.

175class FastWeightsAttention(nn.Module):

#

203    def __init__(self, heads: int, d_model: int, dropout_prob: float, phi: DPFP):
204        super().__init__()

#

Number of features per head $d_{k}$

207        self.d_k = d_model // heads

#

Number of heads

209        self.heads = heads

#

These transform the query , key and value multi-headed attention.

212        self.query = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=False)
213        self.key = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=False)
214        self.value = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=False)

#

Interpolation weight function $σ (W_{β} x^{(i)})$ for each head

217        self.interpolation_weight = nn.Sequential(
218            PrepareForMultiHeadAttention(d_model, heads, 1, bias=False),
219            nn.Sigmoid()
220        )

#

$ϕ^{'}$

223        self.phi = phi

#

Output layer

226        self.output = nn.Linear(d_model, d_model)

#

Dropout

228        self.dropout = nn.Dropout(dropout_prob)

#

230    def forward(self, x: torch.Tensor):

#

Get the number of steps $L$

232        seq_len = x.shape[0]

#

$ϕ^{'} (q^{(i)})$ for all steps and heads

234        query = self.phi(self.query(x))

#

$ϕ^{'} (k^{(i)})$ for all steps and heads

236        key = self.phi(self.key(x))

#

$v^{(i)}$ for all steps and heads

238        value = self.value(x)

#

$β^{(i)}$ for all steps and heads

240        beta = self.interpolation_weight(x)

#

$W^{(0)}$

243        weights = key.new_zeros((key.shape[1], key.shape[2], value.shape[3], key.shape[3]))

#

List to store outputs $y^{(i)}$

245        outputs = []

#

Iterate through steps

248        for i in range(seq_len):

#

$\overset{v}{ˉ}^{(i)} = W^{(i - 1)} ϕ^{'} (k^{(i)})$

250            value_existing = torch.einsum('bhvk,bhk->bhv', weights, key[i])

#

$W^{(i)} = W^{(i - 1)} + β^{(i)} (v^{(i)} - \overset{v}{ˉ}^{(i)}) \otimes ϕ^{'} (k^{(i)})$

255            weights = weights + torch.einsum('bhv,bhk->bhvk', beta[i] * (value[i] - value_existing), key[i])

#

$y^{(i)} = W^{(i)} ϕ^{'} (q^{(i)})$

258            y = torch.einsum('bhvk,bhk->bhv', weights, query[i])

#

Merge multiple heads and append to outputs

261            outputs.append(y.reshape(y.shape[0], -1))

#

Stack outputs at each step into a single tensor

264        x = torch.stack(outputs)

#

Output layer

267        return self.output(x)

#

This is a general transformer layer that combines self attention and feedforward network.

270class FastWeightsAttentionTransformerLayer(nn.Module):

#

274    def __init__(self, *,
275                 d_model: int,
276                 attn: FastWeightsAttention,
277                 feed_forward: FeedForward,
278                 dropout_prob: float):
279        super().__init__()

#

Transformer size $d_{m o d e l}$

281        self.size = d_model

#

Fast weights attention module

283        self.attn = attn

#

Feed-forward network

285        self.feed_forward = feed_forward

#

Dropout layer

287        self.dropout = nn.Dropout(dropout_prob)

#

Normalization layers

290        self.norm_self_attn = nn.LayerNorm([d_model])
291        self.norm_ff = nn.LayerNorm([d_model])

#

293    def forward(self, x: torch.Tensor):

#

Calculate fast weights self attention

295        attn = self.attn(x)

#

Add the self attention results

297        x = x + self.dropout(attn)

#

Normalize for feed-forward

300        z = self.norm_ff(x)

#

Pass through the feed-forward network

302        ff = self.feed_forward(z)

#

Add the feed-forward results back

304        x = x + self.dropout(ff)

#

307        return x

#

This is a general transformer module with multiple transformer layers

310class FastWeightsAttentionTransformer(nn.Module):

#

314    def __init__(self, layer: FastWeightsAttentionTransformerLayer, n_layers: int):
315        super().__init__()

#

Make copies of the transformer layer

317        self.layers = clone_module_list(layer, n_layers)

#

Final normalization layer

319        self.norm = nn.LayerNorm([layer.size])

#

321    def forward(self, x: torch.Tensor):
322        for i, layer in enumerate(self.layers):

#

Get layer output

324            x = layer(x)

#

Normalize the output

327        return self.norm(x)