This is a PyTorch implementation of the paper MLP-Mixer: An all-MLP Architecture for Vision.
This paper applies the model on vision tasks. The model is similar to a transformer with attention layer being replaced by a MLP that is applied across the patches (or tokens in case of a NLP task).
Our implementation of MLP Mixer is a drop in replacement for the self-attention layer in our transformer implementation. So it's just a couple of lines of code, transposing the tensor to apply the MLP across the sequence dimension.
Although the paper applied MLP Mixer on vision tasks, we tried it on a masked language model. Here is the experiment code.