#

LLM.int() on GPT-NeoX

This implements a utility function to transform a nn.Linear layer to LLM.int8() linear layer.

LLM.int8() paper shows you can use int8 quantization while handling outliers to reduce memory footprint without performance degradation in large language models. They convert weights and inputs to scaled 8-bit integers and does matrix multiplication producing int32 results which is then converted back to float16 and rescaled. They show that in large langauge models, some features can give extreme values (outliers) that dominate the model's output. These features get clamped in 8-bit integer space which causes the model performance to degrade. As a solution they pick these outliers (greater than a specified threshold) and compute their multiplications separately in float16 space. Since the percentage of outliers is around 0.01% this doesn't increase memory usage, and prevents the model from degrading performance.

The code to transform GPT-NoeX layers is defined in model.py.

Here are example uses of GPT-NeoX with int8 quantization.

#

Import bitsandbytes package

34try:
35    from bitsandbytes.nn import Linear8bitLt, Int8Params
36except ImportError:
37    raise ImportError('''Please install `bitsandbytes` with `pip install bitsandbytes -U`''')
38
39import torch
40from torch import nn

#

Transform a `nn.Linear` layer to LLM.int8() linear layer

linear_module is the nn.Linear layer to transform
device is the device of the model
threshold is the threshold $α$ to use for outlier detection

43def make_llm_int8_linear(linear_module: nn.Linear, device: torch.device, threshold: float = 6.0):

#

53    assert isinstance(linear_module, nn.Linear)

#

Create an empty Linear8bitLt module

56    int8_lin = Linear8bitLt(
57        linear_module.in_features,
58        linear_module.out_features,
59        linear_module.bias is not None,
60        has_fp16_weights=False,
61        threshold=threshold,
62    )

#

Quantize the weights

65    int8_lin._parameters['weight'] = Int8Params(linear_module.weight.data.cpu(),
66                                                requires_grad=False,
67                                                has_fp16_weights=False).to(device)

#

Set the bias in float16 space

70    if linear_module.bias is not None:
71        int8_lin._parameters['bias'] = nn.Parameter(linear_module.bias.data,
72                                                    requires_grad=False)

#

75    return int8_lin

LLM.int() on GPT-NeoX

Transform a nn.Linear layer to LLM.int8() linear layer

Transform a `nn.Linear` layer to LLM.int8() linear layer