#

Sophia Optimizer

This is a PyTorch implementation of Sophia-G from paper Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. Official implementation is available at Liuhong99/Sophia.

Sophia is more adaptive to heterogeneous curvatures than Adam, more resistant to non-convexity and rapid change of Hessian than Newton’s method, and also uses a low-cost pre-conditioner.

Sophia keeps diagonal Hessian estimates with EMA across iterations. The diagonal Hessian $\hat{h}_{t}$ is calculated every $k$ steps.

h_{t} = β_{2} h_{t - k} + (1 - β_{2}) \hat{h}_{t} if t mod k = 1; else h_{t} = h_{t - 1}

Sophia uses EMA of gradients $m_{t}$ , only considers positive entries of the diagonal Hessian and does per-coordinate clipping to the update.

m_{t} θ_{t + 1} \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) g_{t} \leftarrow θ_{t} - η \cdot clip (\frac{m _{t}}{m a x { h _{t} , ϵ }}, ρ)

where $ϵ$ is a very small value to prevent division by $0$ .

Gauss-Newton-Bartlett (GNB) estimator

\hat{L} (θ) \hat{h}_{t} = \frac{1}{B} b = 1 \sum B ℓ_{CE} (f (θ, x_{b}), \overset{y}{^}_{b}) = B \cdot \nabla_{θ} \hat{L} (θ) ⊙ \nabla_{θ} \hat{L} (θ)

where $x_{b}$ are the inputs, $B$ is the batch size (number of inputs/tokens), $ℓ_{CE}$ is cross entropy loss, and $\overset{y}{^}_{b}$ are sampled from the logits $f (θ, x_{b})$ .

Note that this hessian estimate is always positive and therefore we can replace $m a x {h_{t}, ϵ}$ with $h_{t} + ϵ$ .

Sophia with Gauss-Newton-Bartlett (GNB) estimator is Sophia-G

Here is an experiment that uses Sophia-G to train a transformer.

54from typing import Dict, Any, Tuple, Optional
55
56import torch
57from torch import nn
58
59from labml_nn.optimizers import GenericAdaptiveOptimizer, WeightDecay

#

Sophia-G Optimizer

We extend the class GenericAdaptiveOptimizer defined in __init__.py to implement the Sophia optimizer.

62class Sophia(GenericAdaptiveOptimizer):

#

Initialize the optimizer

params is the list of parameters
lr is the maximum learning rate $η ρ$
betas is a tuple of ( $β_{1}$ , $β_{2}$ )
eps is $ϵ$
pho is $ρ$
weight_decay is an instance of class WeightDecay defined in __init__.py
defaults is a dictionary of default for group values. This is useful when you want to extend the class Adam .

70    def __init__(self, params,
71                 lr: float = 1e-4, betas: Tuple[float, float] = (0.9, 0.95), eps: float = 1e-12,
72                 rho: float = 0.03,
73                 weight_decay: WeightDecay = WeightDecay(),
74                 defaults: Optional[Dict[str, Any]] = None):

#

87        defaults = {} if defaults is None else defaults
88        defaults.update(weight_decay.defaults())
89        defaults.update(dict(rho=rho))
90        super().__init__(params, defaults, lr, betas, eps)
91
92        self.weight_decay = weight_decay

#

Initialize a parameter state

state is the optimizer state of the parameter (tensor)
group stores optimizer attributes of the parameter group
param is the parameter tensor $θ_{t - 1}$

94    def init_state(self, state: Dict[str, any], group: Dict[str, any], param: nn.Parameter):

#

This is the number of optimizer steps taken on the parameter, $t$

104        state['step'] = 0

#

Exponential moving average of gradients, $m_{t}$

106        state['exp_avg'] = torch.zeros_like(param, memory_format=torch.preserve_format)

#

Exponential moving average of Hessian diagonal, $h_{t}$

108        state['hessian'] = torch.zeros_like(param, memory_format=torch.preserve_format)

#

Update the EMA of Hessian diagonal $h_{t}$

n_tokens_training_batch is the number of tokens/inputs in the batch $B$

\hat{h}_{t} h_{t} = B \cdot \nabla_{θ} \hat{L} (θ) ⊙ \nabla_{θ} \hat{L} (θ) = β_{2} h_{t - k} + (1 - β_{2}) \hat{h}_{t}

110    def update_hessian(self, n_tokens_training_batch):

#

Iterate through parameter groups

123        for group in self.param_groups:

#

$β_{2}$

125            _, beta2 = group['betas']

#

Iterate through parameters

127            for p in group['params']:

#

Skip parameters without gradients

129                if p.grad is None:
130                    continue

#

Get optimizer state

133                state = self.state[p]

#

Initialize state if empty

136                if len(state) == 0:
137                    self.init_state(state, group, p)

#

Update EMA Hessian diagonal

\hat{h}_{t} h_{t} = B \cdot \nabla_{θ} \hat{L} (θ) ⊙ \nabla_{θ} \hat{L} (θ) = β_{2} h_{t - k} + (1 - β_{2}) \hat{h}_{t}

145                state['hessian'].mul_(beta2).addcmul_(p.grad, p.grad, value=(1 - beta2) * n_tokens_training_batch)

#

Take an update step for a given parameter tensor

state is the optimizer state of the parameter (tensor)
group stores optimizer attributes of the parameter group
grad is the current gradient tensor $g_{t}$ for the parameter $θ_{t - 1}$
param is the parameter tensor $θ_{t - 1}$

We do the following parameter update,

θ_{t + 1} \leftarrow θ_{t} - η \cdot clip (\frac{m _{t}}{h _{t} + ϵ}, ρ)

147    def step_param(self, state: Dict[str, any], group: Dict[str, any], grad: torch.Tensor, param: torch.nn.Parameter):

#

Calculate weight decay

164        grad = self.weight_decay(param, grad, group)

#

Get $β_{1}$ and $β_{2}$

167        beta1, beta2 = group['betas']

#

Get $ρ$

169        rho = group['rho']

#

Get $m_{t - 1}$ and $h_{t}$

172        m, hessian = state['exp_avg'], state['hessian']

#

In-place calculation of $m_{t}$ $m_{t} \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) \cdot g_{t}$

176        m.mul_(beta1).add_(grad, alpha=1 - beta1)

#

Increment $t$ the number of optimizer steps

179        state['step'] += 1

#

Get maximum learning rate $η ρ$

182        lr = group['lr']

#

$η$

185        eta = lr / rho

#

$clip (\frac{m _{t}}{h _{t} + ϵ}, ρ)$

188        ratio = (m / (hessian + group['eps'])).clamp(-rho, rho)

#

$θ_{t + 1} \leftarrow θ_{t} - η \cdot clip (\frac{m _{t}}{h _{t} + ϵ}, ρ)$

191        param.data.add_(ratio, alpha=-eta)