Meta AI’s Sparse All-MLP Model Doubles Training Efficiency Compared to Transformers
Transformer architectures have established the state-of-the-art on natural language processing (NLP) and many computer vision tasks, and recent research has shown that All-MLP (multi-layer perceptron) architectures also have strong potential in these areas. However, although newly proposed MLP models such as gMLP (Liu et al., 2021a) can match transformers in language modelling perplexity, they still lag in downstream performance.
In the new paper Efficient Language Modeling with Sparse all-MLP, a research team from Meta AI and the State University of New York at Buffalo extends the gMLP model with sparsely activated conditional computation using mixture-of-experts (MoE) techniques. Their resulting sMLP sparsely-activated all-MLP architecture boosts the performance of all-MLPs in large-scale NLP pretraining, achieving training efficiency improvements of up to 2x compared to transformer-based mixture-of-experts (MoE) architectures, transformers, and gMLP.
The team believes their proposed sMLP is the first NLP work to combine all-MLP-based models with MoEs. The paper provides an in-depth analysis of why MLP architectures trail transformers in terms of expressiveness and identifies challenges in turning MLPs into sparsely activated MoEs, challenges sMLP addresses with a novel sMoE module and two routing strategies.
The sMLP architecture comprises both dense blocks and sparse blocks. Each sparse block contains a tMoE module (the team adopts MoE from base layers to replace the FFN module in dense transformers) and an sMoE module that replaces the self-attention module in transformers and the spatial gating unit in gMLP. The tMoE and sMoE blocks both contain expert modules that are used to process inputs and a gating function that decides…