This module contains PyTorch implementations and explanations of original transformer from paper Attention Is All You Need, and derivatives and enhancements of it.
This implements Transformer XL model using relative multi-head attention
This implements Rotary Positional Embeddings (RoPE)
This implements Attention with Linear Biases (ALiBi).
This implements the Retrieval-Enhanced Transformer (RETRO).
This is an implementation of compressive transformer that extends upon Transformer XL by compressing the oldest memories to give a longer attention span.
This is an implementation of GPT-2 architecture.
This is an implementation of the paper GLU Variants Improve Transformer.
This is an implementation of the paper Generalization through Memorization: Nearest Neighbor Language Models.
This is an implementation of the paper Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.
This is a miniature implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Our implementation only has a few million parameters and doesn't do model parallel distributed training. It does single GPU training but we implement the concept of switching as described in the paper.
This is an implementation of the paper Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.
This is an implementation of the paper FNet: Mixing Tokens with Fourier Transforms.
This is an implementation of the paper An Attention Free Transformer.
This is an implementation of Masked Language Model used for pre-training in paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
This is an implementation of the paper MLP-Mixer: An all-MLP Architecture for Vision.
This is an implementation of the paper Pay Attention to MLPs.
This is an implementation of the paper An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.
This is an implementation of the paper Primer: Searching for Efficient Transformers for Language Modeling.
This is an implementation of the paper Hierarchical Transformers Are More Efficient Language Models
112from .configs import TransformerConfigs
113from .models import TransformerLayer, Encoder, Decoder, Generator, EncoderDecoder
114from .mha import MultiHeadAttention
115from labml_nn.transformers.xl.relative_mha import RelativeMultiHeadAttention