#

GPT-NEOX 上的 llm.int ()

这实现了一个实用程序函数，将nn.Linear 层转换为 LLM.int8 () 线性层。

LLM.int8 () 论文展示了在处理异常值时可以使用 int8 量化来减少内存占用，而不会降低大型语言模型的性能。它们将权重和输入转换为按比例缩放的8位整数，并进行矩阵乘法产生int32结果，然后将其转换回float16并重新缩放。它们表明，在大型语言模型中，某些特征可以给出极值（异常值），这些值在模型的输出中占据主导地位。这些特征被限制在 8 位整数空间中，这会导致模型性能下降。作为解决方案，他们选择这些异常值（大于指定阈值），并在float16空间中分别计算它们的乘法。由于异常值的百分比约为 0.01%，因此不会增加内存使用量，并防止模型降低性能。

用于转换 GPT-NOEX 层的代码在 model.py 中定义。

以下是使用 int8 量化的 GPT-NEOX 的示例用法。

#

导入bitsandbytes 包

34try:
35    from bitsandbytes.nn import Linear8bitLt, Int8Params
36except ImportError:
37    raise ImportError('''Please install `bitsandbytes` with `pip install bitsandbytes -U`''')
38
39import torch
40from torch import nn

#

将`nn.Linear` 图层转换为 LLM.int8 () 线性图层

linear_module 是要变换的nn.Linear 图层
device 是该型号的设备
threshold 是用于异常值检测的阈 $α$ 值

43def make_llm_int8_linear(linear_module: nn.Linear, device: torch.device, threshold: float = 6.0):

#

53    assert isinstance(linear_module, nn.Linear)

#

创建一个空的 Linear8bitLT 模块

56    int8_lin = Linear8bitLt(
57        linear_module.in_features,
58        linear_module.out_features,
59        linear_module.bias is not None,
60        has_fp16_weights=False,
61        threshold=threshold,
62    )

#

量化权重

65    int8_lin._parameters['weight'] = Int8Params(linear_module.weight.data.cpu(),
66                                                requires_grad=False,
67                                                has_fp16_weights=False).to(device)

#

在 float16 空间中设置偏差

70    if linear_module.bias is not None:
71        int8_lin._parameters['bias'] = nn.Parameter(linear_module.bias.data,
72                                                    requires_grad=False)

#

75    return int8_lin

GPT-NEOX 上的 llm.int ()

将nn.Linear 图层转换为 LLM.int8 () 线性图层

将`nn.Linear` 图层转换为 LLM.int8 () 线性图层