CMU & Meta’s TriForce: Turbocharging Long Sequence Generation with 2.31× Speed Boost on A100 GPU

Synced
SyncedReview
Published in
3 min readApr 28, 2024

--

Large language models (LLMs) endowed with long-context capabilities, such as GPT-4 and Gemini, are increasingly finding versatile applications in various domains like chatbots, vision generation, and financial analysis. However, their efficacy is hampered by the inefficient utilization of computational resources and a substantial memory footprint, particularly when tasked with generating long sequences.

Addressing these challenges, in a new paper TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, a research team from Carnegie Mellon University and Meta AI introduces TriForce — a hierarchical speculative decoding system tailored for scalable long sequence generation. TriForce not only achieves remarkable speedups for models like Llama2–7B-128K, reaching up to 2.31× on an A100 GPU, but also demonstrates scalability in handling even lengthier contexts.

The researchers identified three crucial insights that guided the development of TriForce:

  1. Hierarchical Speculation for Dual Memory Bottlenecks: Recognizing two primary memory bottlenecks — model weights and key-value (KV) cache — the team observed that as context length increases, the latter gradually becomes the dominant bottleneck. This led them to employ hierarchical speculation, addressing these bottlenecks sequentially with different draft models.
  2. Leveraging Attention Sparsity for Speculative Decoding: By identifying significant redundancy within the KV cache, the researchers found that a small portion of it is adequate to achieve a high acceptance rate. They utilized partial KV cache as a draft cache for self-speculation, capitalizing on attention sparsity.
  3. Exploiting Contextual Locality for Drafting Efficiency: Discovering that adjacent tokens often require similar information from long context tokens, the team leveraged this contextual locality to enhance drafting efficiency.

Building upon these insights, TriForce employs retrieval-based drafting and hierarchical speculation to effectively tackle the identified bottlenecks. It utilizes the original model weights and dynamic sparse KV cache via retrieval as a draft model, serving as an intermediate layer in the hierarchy, further speculated by a smaller model to reduce drafting latency.

TriForce’s performance speaks volumes: achieving notable speedups for Llama2–7B-128K, up to 2.31× on an A100 GPU, and showcasing scalability in handling even longer contexts. In an offloading setting on two RTX 4090 GPUs, TriForce achieves a token generation speed of 0.108s/token — only half as slow as the auto-regressive baseline on an A100, which attains 7.78× on the optimized offloading system. Furthermore, TriForce outperforms DeepSpeed-Zero-Inference on a single RTX 4090 GPU by 4.86×. These achievements underscore TriForce’s potential to revolutionize the serving of long-context models for extensive sequence generation.

The paper TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global