This is a PyTorch implementation of the paper Primer: Searching for Efficient Transformers for Language Modeling.
The authors do an evolutionary search for transformer architectures. They name the architecture found using the search as Primer (PRIMitives searched transformER). Primer EZ is the architecture with the two most robust modifications in Primer compared to the original transformer. Primer EZ trains a lot faster than the vanilla transformer.
The most effective modification found by the search is using a square ReLU instead of ReLU in the position-wise feedforward module.
The next effective modification is a depth-wise 3 X 1 convolution after multi-head projection for queries, keys, and values. The convolution is along the sequence dimension and per channel (depth-wise). To be clear, if the number of channels in each head is d_k the convolution will have 1 X 3 kernels for each of the d_k channels.
Here is the experiment code, for Primer EZ.