Member-only story

Meta AI’s Sparse All-MLP Model Doubles Training Efficiency Compared to Transformers

4 min readMar 17, 2022

Transformer architectures have established the state-of-the-art on natural language processing (NLP) and many computer vision tasks, and recent research has shown that All-MLP (multi-layer perceptron) architectures also have strong potential in these areas. However, although newly proposed MLP models such as gMLP (Liu et al., 2021a) can match transformers in language modelling perplexity, they still lag in downstream performance.

In the new paper Efficient Language Modeling with Sparse all-MLP, a research team from Meta AI and the State University of New York at Buffalo extends the gMLP model with sparsely activated conditional computation using mixture-of-experts (MoE) techniques. Their resulting sMLP sparsely-activated all-MLP architecture boosts the performance of all-MLPs in large-scale NLP pretraining, achieving training efficiency improvements of up to 2x compared to transformer-based mixture-of-experts (MoE) architectures, transformers, and gMLP.

The team believes their proposed sMLP is the first NLP work to combine all-MLP-based models with MoEs. The paper provides an in-depth analysis of why MLP…

Meta AI’s Sparse All-MLP Model Doubles Training Efficiency Compared to Transformers

Written by Synced

No responses yet