The effective modelling of long-term dependencies enables conditioning new model outputs on previous inputs and is critical when dealing with longer text, audio or video contexts. However, when modelling the long-term dependencies of audio signals, even smaller time scales can yield hundreds of thousands of samples. While transformer architectures have…