Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation

In the early days of NLP research, establishing long-term dependencies brought with it the vanishing gradient problem, as nascent models handled input sequences one by one, without parallelization. More recently, revolutionary transformer-based architectures and their self-attention mechanisms have enabled…