English Version | 中文版

Authors: Xuqin Zhang*, Quan He*, Zhenrui Zheng, Zongzhang Zhang, Xu He, Dong Li

(*: Co-first)

<aside>

📄 Paper: arxiv.org

🧠 Model: 🤗 ASTER-4B-RL

⌨️ Code&Dataset: https://github.com/Rainyrou/ASTER, Special thanks to @Rainrou for reprodution

</aside>

Overview

We observe that in Tool-Integrated Reasoning (TIR, where the model can interact with code tools), existing training recipes such as ReTool and Zero-TIR often suffer from interaction collapse and fail to fully exploit the potential of code tools. After analyzing factors such as cold-start, interaction density, and reasoning budget, we propose a new recipe based on continued training of Qwen3-4B. Compared to prior methods, our approach significantly lifts the model’s capability ceiling. With only 4B parameters, our trained model reaches the level of top-tier large models like DeepSeek v3.2-exp on classic math benchmarks such as AIME24/25 and HMMT.

Figure 1: AIME 2025 vs parameter scale. Our model ASTER-4B achieves 90.0, not only far surpassing models in the same size bracket, but also matching frontier models with 100× more parameters such as DeepSeek-V3.2-exp (671B), MiniMax-M2.5(~230B), demonstrating a clear efficiency advantage.

Figure 2: Comparison of our training recipe with the baseline and prior methods.

TL;DR

Problem: How can we raise the upper bound of LLM reasoning? Tool-Integrated Reasoning (TIR) provides an interactive code sandbox to improve reasoning efficiency, but naive RL fine-tuning often leads to interaction collapse: the model gradually gives up sustained tool interaction, “thinks hard” in text, and ends with a one-shot tool call / post-hoc verification.
Insight：The key factor determining the trajectory of TIR-RL is not post-SFT accuracy, but whether the cold-start data contains sufficient interaction density.
Method（ASTER）：Use a small but highly interactive cold-start set (4K trajectories, >9 tool calls/trajectory) to establish a tool-using behavioral prior, then apply multi-stage GRPO RL.
Results：ASTER-4B reaches 90.0 on AIME 2025 (avg@16), with ~20-point overall gains on benchmarks including HMMT and BeyondAIME.

Background: Why doesn’t RL for TIR scale?

Over the past year, RL has been highly successful for long-horizon reasoning (e.g., the “slow thinking” behaviors shown by systems like o1 / R1). But text-only reasoning is inherently fragile: a small mistake can be amplified along long chains, and there is no external feedback to “pin down” intermediate steps.

The intuition behind Tool-Integrated Reasoning (TIR) is straightforward: delegate precise computation / intermediate verification / repeated trial-and-error to tools (most commonly Python execution), while the model focuses on planning, modeling, explanation, and correction. However, when continuing to push TIR models via RL, a common failure mode is Interaction Collapse: during training, the model gradually reduces multi-round tool use, moves most computation and derivation back into text, and ends up with only one (or very few) code calls for verification. We guess this collapse comes from both data priors and optimization dynamics:

Data priors: Mainstream pretraining corpora are dominated by human text and paraphrased text, lacking data where text and code alternate and co-evolve. Even though modern LLMs can write bug-free code better than humans, they may not have formed a stable habit/strategy template during pretraining for “offloading high-risk/high-cost precise computation to tools.”