#

Proximal Policy Optimization - PPO

This is a PyTorch implementation of Proximal Policy Optimization - PPO.

PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.

You can find an experiment that uses it here. The experiment uses Generalized Advantage Estimation.

28import torch
29
30from labml_helpers.module import Module
31from labml_nn.rl.ppo.gae import GAE

#

PPO Loss

Here's how the PPO update rule is derived.

We want to maximize policy reward $θ max J (π_{θ}) = E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} r_{t}]$ where $r$ is the reward, $π$ is the policy, $τ$ is a trajectory sampled from policy, and $γ$ is the discount factor between $[0, 1]$ .

E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} A^{π_{O L D}} (s_{t}, a_{t})] E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} (Q^{π_{O L D}} (s_{t}, a_{t}) - V^{π_{O L D}} (s_{t}))] E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} (r_{t} + V^{π_{O L D}} (s_{t + 1}) - V^{π_{O L D}} (s_{t}))] E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} (r_{t})] - E_{τ \sim π_{θ}} [V^{π_{O L D}} (s_{0})] = = = = J (π_{θ}) - J (π_{θ_{O L D}})

So, $θ max J (π_{θ}) = θ max E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} A^{π_{O L D}} (s_{t}, a_{t})]$

Define discounted-future state distribution, $d^{π} (s) = (1 - γ) t = 0 \sum \infty γ^{t} P (s_{t} = s ∣ π)$

Then,

J (π_{θ}) - J (π_{θ_{O L D}}) = E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} A^{π_{O L D}} (s_{t}, a_{t})] = \frac{1}{1 - γ} E_{s \sim d^{π_{θ}}, a \sim π_{θ}} [A^{π_{O L D}} (s, a)]

Importance sampling $a$ from $π_{θ_{O L D}}$ ,

J (π_{θ}) - J (π_{θ_{O L D}}) = \frac{1}{1 - γ} E_{s \sim d^{π_{θ}}, a \sim π_{θ}} [A^{π_{O L D}} (s, a)] = \frac{1}{1 - γ} E_{s \sim d^{π_{θ}}, a \sim π_{θ_{O L D}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{O L D}} ( a ∣ s )} A^{π_{O L D}} (s, a)]

Then we assume $d^{π_{θ}} (s)$ and $d^{π_{θ_{O L D}}} (s)$ are similar. The error we introduce to $J (π_{θ}) - J (π_{θ_{O L D}})$ by this assumption is bound by the KL divergence between $π_{θ}$ and $π_{θ_{O L D}}$ . Constrained Policy Optimization shows the proof of this. I haven't read it.

J (π_{θ}) - J (π_{θ_{O L D}}) = \frac{1}{1 - γ} E_{a \sim π _{θ_{O L D}} s \sim d ^{π_{θ}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{O L D}} ( a ∣ s )} A^{π_{O L D}} (s, a)] \approx \frac{1}{1 - γ} E_{a \sim π _{θ_{O L D}} s \sim d ^{π_{θ_{O L D}}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{O L D}} ( a ∣ s )} A^{π_{O L D}} (s, a)] = \frac{1}{1 - γ} L^{CP I}

34class ClippedPPOLoss(Module):

#

136    def __init__(self):
137        super().__init__()

#

139    def forward(self, log_pi: torch.Tensor, sampled_log_pi: torch.Tensor,
140                advantage: torch.Tensor, clip: float) -> torch.Tensor:

#

ratio $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{O L D}} ( a _{t} ∣ s _{t} )}$ ; this is different from rewards $r_{t}$ .

143        ratio = torch.exp(log_pi - sampled_log_pi)

#

Cliping the policy ratio

L^{C L I P} (θ) = E_{a_{t}, s_{t} \sim π_{θ O L D}} [min (r_{t} (θ) \overset{ˉ}{A_{t}}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) \overset{ˉ}{A_{t}})]

The ratio is clipped to be close to 1. We take the minimum so that the gradient will only pull $π_{θ}$ towards $π_{θ_{O L D}}$ if the ratio is not between $1 - ϵ$ and $1 + ϵ$ . This keeps the KL divergence between $π_{θ}$ and $π_{θ_{O L D}}$ constrained. Large deviation can cause performance collapse; where the policy performance drops and doesn't recover because we are sampling from a bad policy.

Using the normalized advantage $\overset{ˉ}{A_{t}} = \frac{A _{t} ^ - μ ( A _{t} ^ )}{σ ( A _{t} ^ )}$ introduces a bias to the policy gradient estimator, but it reduces variance a lot.

172        clipped_ratio = ratio.clamp(min=1.0 - clip,
173                                    max=1.0 + clip)
174        policy_reward = torch.min(ratio * advantage,
175                                  clipped_ratio * advantage)
176
177        self.clip_fraction = (abs((ratio - 1.0)) > clip).to(torch.float).mean()
178
179        return -policy_reward.mean()

#

Clipped Value Function Loss

Similarly we clip the value function update also.

V_{C L I P}^{π_{θ}} (s_{t}) L^{V F} (θ) = c l i p (V^{π_{θ}} (s_{t}) - \hat{V_{t}}, - ϵ, + ϵ) = \frac{1}{2} E [m a x ((V^{π_{θ}} (s_{t}) - R_{t})^{2}, (V_{C L I P}^{π_{θ}} (s_{t}) - R_{t})^{2})]

Clipping makes sure the value function $V_{θ}$ doesn't deviate significantly from $V_{θ_{O L D}}$ .

182class ClippedValueFunctionLoss(Module):

#

204    def forward(self, value: torch.Tensor, sampled_value: torch.Tensor, sampled_return: torch.Tensor, clip: float):
205        clipped_value = sampled_value + (value - sampled_value).clamp(min=-clip, max=clip)
206        vf_loss = torch.max((value - sampled_return) ** 2, (clipped_value - sampled_return) ** 2)
207        return 0.5 * vf_loss.mean()