Sitemap

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

8 min readApr 24, 2025

The remarkable success of OpenAI’s o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).

However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports. Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored. Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets. These challenges complicate the effective scaling of RL methods for LLMs.

Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO). This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions. The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.

Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level

--

--

No responses yet