This is a PyTorch implementation of paper Generalized Advantage Estimation.

You can find an experiment that uses it here.

`15import numpy as np`

`18class GAE:`

```
19 def __init__(self, n_workers: int, worker_steps: int, gamma: float, lambda_: float):
20 self.lambda_ = lambda_
21 self.gamma = gamma
22 self.worker_steps = worker_steps
23 self.n_workers = n_workers
```

$A_{t}^ $ is high bias, low variance, whilst $A_{t}^ $ is unbiased, high variance.

We take a weighted average of $A_{t}^ $ to balance bias and variance. This is called Generalized Advantage Estimation. $A_{t}^ =A_{t}^ =k∑ w_{k}A_{t}^ $ We set $w_{k}=λ_{k−1}$, this gives clean calculation for $A_{t}^ $

$δ_{t}A_{t}^ =r_{t}+γV(s_{t+1})−V(s_{t})=δ_{t}+γλδ_{t+1}+...+(γλ)_{T−t+1}δ_{T−1}=δ_{t}+γλA_{t+1}^ $`25 def __call__(self, done: np.ndarray, rewards: np.ndarray, values: np.ndarray) -> np.ndarray:`

advantages table

```
59 advantages = np.zeros((self.n_workers, self.worker_steps), dtype=np.float32)
60 last_advantage = 0
```

$V(s_{t+1})$

```
63 last_value = values[:, -1]
64
65 for t in reversed(range(self.worker_steps)):
```

mask if episode completed after step $t$

```
67 mask = 1.0 - done[:, t]
68 last_value = last_value * mask
69 last_advantage = last_advantage * mask
```

$δ_{t}$

`71 delta = rewards[:, t] + self.gamma * last_value - values[:, t]`

$A_{t}^ =δ_{t}+γλA_{t+1}^ $

`74 last_advantage = delta + self.gamma * self.lambda_ * last_advantage`

note that we are collecting in reverse order. *My initial code was appending to a list and I forgot to reverse it later. It took me around 4 to 5 hours to find the bug. The performance of the model was improving slightly during initial runs, probably because the samples are similar.*

```
83 advantages[:, t] = last_advantage
84
85 last_value = values[:, t]
86
87 return advantages
```