OPPO AI’s Transformer-Lite Delivers 10x+ Prefill and 2~3x Decoding Boost on Mobile Phone GPUs

Synced
3 min readApr 15, 2024

The Large Language Model (LLM) has showcased remarkable efficacy across various real-world applications, including intelligent assistants, text summarization, translation, and multi-modality tasks on mobile devices. Nonetheless, the current methodologies for on-device deployment of LLMs are hampered by sluggish inference speeds, leading to subpar user experiences.

In a new paper Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, researchers from OPPO AI Center have introduced a solution. They present four optimization techniques and introduce a novel mobile inference engine dubbed Transformer-Lite. This engine outperforms CPU-based FastLLM and GPU-based MLC-LLM, achieving a remarkable over 10x acceleration for prefill speed and 2~3x for decoding speed.

To streamline LLM deployment on device GPUs while ensuring efficiency, the team proposes a fusion of generic mobile inference engines with LLM-specific ones. To tackle this challenge effectively, they introduce four optimization techniques:

  1. Symbolic Expression-based Approach: Supporting dynamic shape model inference involves dynamic shape derivation, memory reuse, and execution scheduling.
  2. Operator Optimizations and Execution Priority Setting: These optimizations aim to enhance performance and reduce phone lagging.
  3. FP4 Quantization Method (M0E4): This method minimizes performance overhead in dequantization, facilitating more efficient matrix multiplication.
  4. Sub-tensor-based Approach: This technique circumvents KV cache copying from model outputs to inputs after each LLM inference iteration.

Moreover, the researchers develop the Transformer-Lite engine and integrate these optimizations. This engine facilitates LLM deployment using ONNX models exported by training frameworks such as PyTorch, ensuring convenient deployment and easy support for new model types.

In their empirical analysis, the team evaluates the performance of the proposed engine using two mobile phones: the OPPO Find X7 24GB memory version and the OPPO Find X7 Ultra 12GB memory version. They select five LLM models with varying structures and parameter sizes: Gemma 2B, Qwen1.5 4B, ChatGLM2 6B, Llama2 7B, and Qwen1.5 14B. By comparing GPU inference with MLC-LLM and CPU inference with FastLLM, they demonstrate the superiority of the Transformer-Lite engine.

Specifically, the Transformer-Lite engine achieves prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for the smaller Gemma 2B, respectively. This represents over a 10x speedup for prefill speed and 2~3x for decoding speed compared to both CPU-based FastLLM and GPU-based MLC-LLM.

The paper Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

--

--

Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global