This implements Transformer XL model using relative multi-head attention
This is an implementation of compressive transformer that extends upon Transformer XL by compressing oldest memories to give a longer attention span.
This is an implementation of GPT-2 architecture.
This is an implementation of the paper GLU Variants Improve Transformer.
This is an implementation of the paper Generalization through Memorization: Nearest Neighbor Language Models.
This is an implementation of the paper Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.
This is a miniature implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Our implementation only has a few million parameters and doesn’t do model parallel distributed training. It does single GPU training but we implement the concept of switching as described in the paper.
This is an implementation of the paper Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.
62from .configs import TransformerConfigs 63from .models import TransformerLayer, Encoder, Decoder, Generator, EncoderDecoder 64from .mha import MultiHeadAttention 65from labml_nn.transformers.xl.relative_mha import RelativeMultiHeadAttention