This is a PyTorch implementation of the GATv2 operator from the paper How Attentive are Graph Attention Networks?.

GATv2s work on graph data similar to GAT. A graph consists of nodes and edges connecting nodes. For example, in Cora dataset the nodes are research papers and the edges are citations that connect the papers.

The GATv2 operator fixes the static attention problem of the standard GAT. Static attention is when the attention to the key nodes has the same rank (order) for any query node. GAT computes attention from query node $i$ to key node $j$ as,

$e_{ij} =LeakyReLU(a_{⊤}[Wh_{i} ∥Wh_{j} ])=LeakyReLU(a_{1}Wh_{i} +a_{2}Wh_{j} ) $Note that for any query node $i$, the attention rank ($argsort$) of keys depends only on $a_{2}Wh_{j} $. Therefore the attention rank of keys remains the same (*static*) for all queries.

GATv2 allows dynamic attention by changing the attention mechanism,

$e_{ij} =a_{⊤}LeakyReLU(W[h_{i} ∥h_{j} ])=a_{⊤}LeakyReLU(W_{l}h_{i} +W_{r}h_{j} ) $The paper shows that GATs static attention mechanism fails on some graph problems with a synthetic dictionary lookup dataset. It's a fully connected bipartite graph where one set of nodes (query nodes) have a key associated with it and the other set of nodes have both a key and a value associated with it. The goal is to predict the values of query nodes. GAT fails on this task because of its limited static attention.

Here is the training code for training a two-layer GATv2 on Cora dataset.

```
57import torch
58from torch import nn
59
60from labml_helpers.module import Module
```

This is a single graph attention v2 layer. A GATv2 is made up of multiple such layers. It takes $h={h_{1} ,h_{2} ,…,h_{N} }$, where $h_{i} ∈R_{F}$ as input and outputs $h_{′}={h_{1} ,h_{2} ,…,h_{N} }$, where $h_{i} ∈R_{F_{′}}$.

`63class GraphAttentionV2Layer(Module):`

`in_features`

, $F$, is the number of input features per node`out_features`

, $F_{′}$, is the number of output features per node`n_heads`

, $K$, is the number of attention heads`is_concat`

whether the multi-head results should be concatenated or averaged`dropout`

is the dropout probability`leaky_relu_negative_slope`

is the negative slope for leaky relu activation`share_weights`

if set to`True`

, the same matrix will be applied to the source and the target node of every edge

```
76 def __init__(self, in_features: int, out_features: int, n_heads: int,
77 is_concat: bool = True,
78 dropout: float = 0.6,
79 leaky_relu_negative_slope: float = 0.2,
80 share_weights: bool = False):
```

```
90 super().__init__()
91
92 self.is_concat = is_concat
93 self.n_heads = n_heads
94 self.share_weights = share_weights
```

Calculate the number of dimensions per head

```
97 if is_concat:
98 assert out_features % n_heads == 0
```

If we are concatenating the multiple heads

```
100 self.n_hidden = out_features // n_heads
101 else:
```

If we are averaging the multiple heads

`103 self.n_hidden = out_features`

Linear layer for initial source transformation; i.e. to transform the source node embeddings before self-attention

`107 self.linear_l = nn.Linear(in_features, self.n_hidden * n_heads, bias=False)`

If `share_weights`

is `True`

the same linear layer is used for the target nodes

```
109 if share_weights:
110 self.linear_r = self.linear_l
111 else:
112 self.linear_r = nn.Linear(in_features, self.n_hidden * n_heads, bias=False)
```

Linear layer to compute attention score $e_{ij}$

`114 self.attn = nn.Linear(self.n_hidden, 1, bias=False)`

The activation for attention score $e_{ij}$

`116 self.activation = nn.LeakyReLU(negative_slope=leaky_relu_negative_slope)`

Softmax to compute attention $α_{ij}$

`118 self.softmax = nn.Softmax(dim=1)`

Dropout layer to be applied for attention

`120 self.dropout = nn.Dropout(dropout)`

`h`

, $h$ is the input node embeddings of shape`[n_nodes, in_features]`

.`adj_mat`

is the adjacency matrix of shape`[n_nodes, n_nodes, n_heads]`

. We use shape`[n_nodes, n_nodes, 1]`

since the adjacency is the same for each head. Adjacency matrix represent the edges (or connections) among nodes.`adj_mat[i][j]`

is`True`

if there is an edge from node`i`

to node`j`

.

`122 def forward(self, h: torch.Tensor, adj_mat: torch.Tensor):`

Number of nodes

`132 n_nodes = h.shape[0]`

The initial transformations, $g_{l}_{i} =W_{l}_{k}h_{i} $ $g_{r}_{i} =W_{r}_{k}h_{i} $ for each head. We do two linear transformations and then split it up for each head.

```
138 g_l = self.linear_l(h).view(n_nodes, self.n_heads, self.n_hidden)
139 g_r = self.linear_r(h).view(n_nodes, self.n_heads, self.n_hidden)
```

We calculate these for each head $k$. *We have omitted $⋅_{k}$ for simplicity*.

$e_{ij}=a(W_{l}h_{i} ,W_{r}h_{j} )=a(g_{l}_{i} ,g_{r}_{j} )$

$e_{ij}$ is the attention score (importance) from node $j$ to node $i$. We calculate this for each head.

$a$ is the attention mechanism, that calculates the attention score. The paper sums $g_{l}_{i} $, $g_{r}_{j} $ followed by a $LeakyReLU$ and does a linear transformation with a weight vector $a∈R_{F_{′}}$

$e_{ij}=a_{⊤}LeakyReLU([g_{l}_{i} +g_{r}_{j} ])$ Note: The paper desrcibes $e_{ij}$ as $e_{ij}=a_{⊤}LeakyReLU(W[h_{i} ∥h_{j} ])$ which is equivalent to the definition we use here.

First we calculate $[g_{l}_{i} +g_{r}_{j} ]$ for all pairs of $i,j$.

`g_l_repeat`

gets ${g_{l}_{1} ,g_{l}_{2} ,…,g_{l}_{N} ,g_{l}_{1} ,g_{l}_{2} ,…,g_{l}_{N} ,...}$ where each node embedding is repeated `n_nodes`

times.

`177 g_l_repeat = g_l.repeat(n_nodes, 1, 1)`

`g_r_repeat_interleave`

gets ${g_{r}_{1} ,g_{r}_{1} ,…,g_{r}_{1} ,g_{r}_{2} ,g_{r}_{2} ,…,g_{r}_{2} ,...}$ where each node embedding is repeated `n_nodes`

times.

`182 g_r_repeat_interleave = g_r.repeat_interleave(n_nodes, dim=0)`

Now we add the two tensors to get ${g_{l}_{1} +g_{r}_{1} ,g_{l}_{1} +g_{r}_{2} ,…,g_{l}_{1} +g_{r}_{N} ,g_{l}_{2} +g_{r}_{1} ,g_{l}_{2} +g_{r}_{2} ,…,g_{l}_{2} +g_{r}_{N} ,...}$

`190 g_sum = g_l_repeat + g_r_repeat_interleave`

Reshape so that `g_sum[i, j]`

is $g_{l}_{i} +g_{r}_{j} $

`192 g_sum = g_sum.view(n_nodes, n_nodes, self.n_heads, self.n_hidden)`

Calculate $e_{ij}=a_{⊤}LeakyReLU([g_{l}_{i} +g_{r}_{j} ])$ `e`

is of shape `[n_nodes, n_nodes, n_heads, 1]`

`200 e = self.attn(self.activation(g_sum))`

Remove the last dimension of size `1`

`202 e = e.squeeze(-1)`

The adjacency matrix should have shape `[n_nodes, n_nodes, n_heads]`

or`[n_nodes, n_nodes, 1]`

```
206 assert adj_mat.shape[0] == 1 or adj_mat.shape[0] == n_nodes
207 assert adj_mat.shape[1] == 1 or adj_mat.shape[1] == n_nodes
208 assert adj_mat.shape[2] == 1 or adj_mat.shape[2] == self.n_heads
```

Mask $e_{ij}$ based on adjacency matrix. $e_{ij}$ is set to $−∞$ if there is no edge from $i$ to $j$.

`211 e = e.masked_fill(adj_mat == 0, float('-inf'))`

We then normalize attention scores (or coefficients) $α_{ij}=softmax_{j}(e_{ij})=∑_{j_{′}∈N_{i}}exp(e_{ij_{′}})exp(e_{ij}) $

where $N_{i}$ is the set of nodes connected to $i$.

We do this by setting unconnected $e_{ij}$ to $−∞$ which makes $exp(e_{ij})∼0$ for unconnected pairs.

`221 a = self.softmax(e)`

Apply dropout regularization

`224 a = self.dropout(a)`

Calculate final output for each head $h_{i} =j∈N_{i}∑ α_{ij}g_{r}_{j,k} $

`228 attn_res = torch.einsum('ijh,jhf->ihf', a, g_r)`

Concatenate the heads

`231 if self.is_concat:`

$h_{i} =∥∥ _{k=1}h_{i} $

`233 return attn_res.reshape(n_nodes, self.n_heads * self.n_hidden)`

Take the mean of the heads

`235 else:`

$h_{i} =K1 k=1∑K h_{i} $

`237 return attn_res.mean(dim=1)`