Google & IDSIA’s Block-Recurrent Transformer Dramatically Outperforms Transformers Over Very Long Sequences

The increasing popularity of transformer architectures in natural language processing (NLP) and other AI research areas is largely attributable to their superior expressive capability when handling long input sequences. A major drawback limiting transformer deployment is that the computational complexity of…