This is an implementation of Wasserstein GAN.

The original GAN loss is based on Jensen-Shannon (JS) divergence between the real distribution $P_{r}$ and generated distribution $P_{g}$. The Wasserstein GAN is based on Earth Mover distance between these distributions.

$W(P_{r},P_{g})=γ∈Π(P_{r},P_{g})inf E_{(x,y)∼γ}∥x−y∥$

$Π(P_{r},P_{g})$ is the set of all joint distributions, whose marginal probabilities are $γ(x,y)$.

$E_{(x,y)∼γ}∥x−y∥$ is the earth mover distance for a given joint distribution ($x$ and $y$ are probabilities).

So $W(P_{r},Pg)$ is equal to the least earth mover distance for any joint distribution between the real distribution $P_{r}$ and generated distribution $P_{g}$.

The paper shows that Jensen-Shannon (JS) divergence and other measures for the difference between two probability distributions are not smooth. And therefore if we are doing gradient descent on one of the probability distributions (parameterized) it will not converge.

Based on Kantorovich-Rubinstein duality, $W(P_{r},P_{g})=∥f∥_{L}≤1sup E_{x∼P_{r}}[f(x)]−E_{x∼P_{g}}[f(x)]$

where $∥f∥_{L}≤1$ are all 1-Lipschitz functions.

That is, it is equal to the greatest difference $E_{x∼P_{r}}[f(x)]−E_{x∼P_{g}}[f(x)]$ among all 1-Lipschitz functions.

For $K$-Lipschitz functions, $W(P_{r},P_{g})=∥f∥_{L}≤Ksup E_{x∼P_{r}}[K1 f(x)]−E_{x∼P_{g}}[K1 f(x)]$

If all $K$-Lipschitz functions can be represented as $f_{w}$ where $f$ is parameterized by $w∈W$,

$K⋅W(P_{r},P_{g})=w∈Wmax E_{x∼P_{r}}[f_{w}(x)]−E_{x∼P_{g}}[f_{w}(x)]$

If $(P_{g})$ is represented by a generator $g_{θ}(z)$ and $z$ is from a known distribution $z∼p(z)$,

$KcdotW(P_{r},P_{θ})=w∈Wmax E_{x∼P_{r}}[f_{w}(x)]−E_{z∼p(z)}[f_{w}(g_{θ}(z))]$

Now to converge $g_{θ}$ with $P_{r}$ we can gradient descent on $θ$ to minimize above formula.

Similarly we can find $max_{w∈W}$ by ascending on $w$, while keeping $K$ bounded. *One way to keep $K$ bounded is to clip all weights in the neural network that defines $f$ clipped within a range.*

Here is the code to try this on a simple MNIST generation experiment.

```
87import torch.utils.data
88from torch.nn import functional as F
89
90from labml_helpers.module import Module
```

We want to find $w$ to maximize $E_{x∼P_{r}}[f_{w}(x)]−E_{z∼p(z)}[f_{w}(g_{θ}(z))]$, so we minimize, $−m1 i=1∑m f_{w}(x_{(i)})+m1 i=1∑m f_{w}(g_{θ}(z_{(i)}))$

`93class DiscriminatorLoss(Module):`

`f_real`

is $f_{w}(x)$`f_fake`

is $f_{w}(g_{θ}(z))$

This returns the a tuple with losses for $f_{w}(x)$ and $f_{w}(g_{θ}(z))$, which are later added. They are kept separate for logging.

`104 def forward(self, f_real: torch.Tensor, f_fake: torch.Tensor):`

We use ReLUs to clip the loss to keep $f∈[−1,+1]$ range.

`115 return F.relu(1 - f_real).mean(), F.relu(1 + f_fake).mean()`

We want to find $θ$ to minimize $E_{x∼P_{r}}[f_{w}(x)]−E_{z∼p(z)}[f_{w}(g_{θ}(z))]$ The first component is independent of $θ$, so we minimize, $−m1 i=1∑m f_{w}(g_{θ}(z_{(i)}))$

`118class GeneratorLoss(Module):`

`f_fake`

is $f_{w}(g_{θ}(z))$

`130 def forward(self, f_fake: torch.Tensor):`

`134 return -f_fake.mean()`