You can find an experiment that uses it here.
15import numpy as np
19 def __init__(self, n_workers: int, worker_steps: int, gamma: float, lambda_: float): 20 self.lambda_ = lambda_ 21 self.gamma = gamma 22 self.worker_steps = worker_steps 23 self.n_workers = n_workers
is high bias, low variance, whilst is unbiased, high variance.
We take a weighted average of to balance bias and variance. This is called Generalized Advantage Estimation. We set , this gives clean calculation for
25 def __call__(self, done: np.ndarray, rewards: np.ndarray, values: np.ndarray) -> np.ndarray:
59 advantages = np.zeros((self.n_workers, self.worker_steps), dtype=np.float32) 60 last_advantage = 0
63 last_value = values[:, -1] 64 65 for t in reversed(range(self.worker_steps)):
mask if episode completed after step
67 mask = 1.0 - done[:, t] 68 last_value = last_value * mask 69 last_advantage = last_advantage * mask
71 delta = rewards[:, t] + self.gamma * last_value - values[:, t]
74 last_advantage = delta + self.gamma * self.lambda_ * last_advantage
note that we are collecting in reverse order. My initial code was appending to a list and I forgot to reverse it later. It took me around 4 to 5 hours to find the bug. The performance of the model was improving slightly during initial runs, probably because the samples are similar.
83 advantages[:, t] = last_advantage 84 85 last_value = values[:, t] 86 87 return advantages