Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation

3 min readFeb 11, 2021

In the early days of NLP research, establishing long-term dependencies brought with it the vanishing gradient problem, as nascent models handled input sequences one by one, without parallelization. More recently, revolutionary transformer-based architectures and their self-attention mechanisms have enabled interactions of token pairs across full sequences, modelling arbitrary dependencies in a constant number of layers to achieve state-of-the-art performance across many NLP tasks.

These advantages however came with a high cost, as transformer-based networks’ memory and computational requirements grow quadratically with sequence length, resulting in major efficiency bottlenecks when dealing with long sequences. In the new paper Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention, researchers from the University of Wisconsin-Madison, UC Berkeley, Google Brain and American Family Insurance propose Nyströmformer, an O(n) approximation in both memory and time for self-attention designed to reduce the quadratic cost associated with long input sequences.

The Nyström method is an efficient technique for obtaining a low-rank approximation of a large kernel matrix. The researchers’ proposed method leverages Nyström approximation tailored for a softmax matrix to reduce complexity from O(n² ) to O(n) for self-attention computation.

*A Nyström approximation of a softmax matrix in self-attention*

*Pipeline for Nyström approximation of softmax matrix in self-attention*

The basic idea behind the algorithm is to first define the matrix form of landmarks, then use these to form the three matrices needed for approximation. The landmarks are selected before the softmax operation to generate the approximation, which avoids calculating the full softmax matrix S. The Nystrom approximation thus scales linearly (O(n) complexity) with regard to input sequence length in terms of both memory and time.

*Proposed architecture of efficient self-attention via Nyström approximation*

Given an input key K and query Q, the proposed Nyströmformer first uses Segment-means to compute landmark points. Based on the landmark points, the architecture then calculates the Nyström approximation using approximate Moore- Penrose pseudoinverse.

To evaluate the model, the researchers conducted experiments in transfer learning setting in two stages. In the first, Nyströmformer was trained on BookCorpus and English Wikipedia data. Next, the pretrained Nyströmformer was fine-tuned for different NLP tasks on the GLUE (General Language Understanding Evaluation) benchmark datasets (SST-2, MRPC, QNLI, QQP and MNLI) and IMDB reviews. For both stages, the baseline was popular transformer model BERT.

*Memory consumption and running time results on various input sequence lengths*

*Results on natural language understanding tasks. F1 score for MRPC and QQP and accuracy for others.*

The results show that Nyströmformer offers favourable memory and time efficiency over standard self-attention and Longformer self-attention, and performs competitively with the BERT-base model. Overall, Nyströmformer provides self-attention approximation with high efficiency, a big step towards running transformer models on very long sequences.

The paper Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation

Written by Synced