Authors

Jiazheng Li¹², Hongzhou Lin⁴, Hong Lu¹², Kaiyue Wen³, Zaiwen Yang¹, Jiaxuan Gao¹, Yi Wu¹², Jingzhao Zhang¹²

¹Tsinghua University

²Shanghai Qi Zhi Institute

³Stanford University

⁴Amazon. This work is independent of and outside of the work at Amazon.

<aside> 💡

TL;DR

QuestA (Question Augmentation) introduces an approach for improving reasoning capacity in Reinforcement Learning (RL). By injecting partial solution hints into the training process, QuestA achieves two major results:

  1. SOTA Performance on Pass@1: It delivers State-of-the-Art results for 1.5B models, surpassing even earlier 32B models on key benchmarks.
  2. Boost Pass@k: While improving Pass@1, QuestA does not degrade Pass@k performance—in fact, it boosts model capacity by enabling models to reason more effectively across multiple attempts.

This observation in RL training opens the door to more powerful models with increased reasoning capabilities. QuestA enables RL to work efficiently across tasks of varying difficulty, eliminating the usual tradeoff between easy and hard problems.

For more detailed information,

Figure 1: QuestA is a data augmentation method that injects partial solutions to effectively scaffold RL training on hard reasoning problems. We construct 26K high-quality augmented prompts from challenging instances in OpenR1, and fine-tune models using 32K-context-length RL. When applied to Nemotron-1.5B, QuestA delivers substantial performance gains—achieving new state-of-the-art results across all math benchmarks for 1.5B-parameter models.

Figure 1: QuestA is a data augmentation method that injects partial solutions to effectively scaffold RL training on hard reasoning problems. We construct 26K high-quality augmented prompts from challenging instances in OpenR1, and fine-tune models using 32K-context-length RL. When applied to Nemotron-1.5B, QuestA delivers substantial performance gains—achieving new state-of-the-art results across all math benchmarks for 1.5B-parameter models.

Figure 2: We compare pass@k curves of RLVR-trained models, with and without QuestA. As a controlled experiment, we perform RL training using either easy or hard prompts. Standard RL on easy prompts (red) shows clear degradation in pass@k as k increases compared to the base model (blue). Training on hard prompts (green) improves pass@k, but comes at the cost of substantially longer training. This motivates our development of QuestA, which scaffolds hard problems to improve training efficiency and delivers consistently stronger results: the RL+QuestA model (orange) stays above standard RL (red) across all k, while also preserving or improving performance at larger k relative to RL trained with hard prompts.

Figure 2: We compare pass@k curves of RLVR-trained models, with and without QuestA. As a controlled experiment, we perform RL training using either easy or hard prompts. Standard RL on easy prompts (red) shows clear degradation in pass@k as k increases compared to the base model (blue). Training on hard prompts (green) improves pass@k, but comes at the cost of substantially longer training. This motivates our development of QuestA, which scaffolds hard problems to improve training efficiency and delivers consistently stronger results: the RL+QuestA model (orange) stays above standard RL (red) across all k, while also preserving or improving performance at larger k relative to RL trained with hard prompts.

Key Results: Model Capacity Can Grow through RL

QuestA has achieved a significant result:

  1. Pass@1 Improvement: QuestA leads to a dramatic improvement in Pass@1. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25, surpassing even DeepSeek-R1-Distill-32B despite being a much smaller model. This indicates that QuestA significantly enhances the model's performance in regular usage.
  2. Pass@k Improvement: Different from previous RL methods, QuestA also improves Pass@k, demonstrating that the model’s capacity increases as RL training progresses. This is a key distinction, as it shows that QuestA allows for sustained exploration and reasoning, unlike other approaches where Pass@k performance degrades when optimizing for Pass@1.

The Tradeoff: Easy Tasks vs. Hard Tasks

For years, there has been a tradeoff in RL training: easy tasks lead to overconfidence in the model, while hard tasks improve reasoning but slow down learning due to low sample efficiency.

This tradeoff has always been a challenge for RL models, but QuestA eliminates it. By introducing partial solution hints during training on hard tasks, QuestA helps the model learn faster without sacrificing performance on easy tasks. This ensures that the model benefits from both easy and hard tasks, improving its reasoning ability without overfitting or slow learning.

The QuestA Recipe: A Hint is All You Need

QuestA enhances the training process by introducing partial solution hints to complex problems, helping models learn more efficiently with the following key design.