English Version | 中文版

Authors: Xuqin Zhang*, Quan He*, Zhenrui Zheng, Zongzhang Zhang, Xu He, Dong Li

(*: Co-first)

<aside>

📄 Paper: arxiv.org

🧠 Model: 🤗 ASTER-4B-RL

⌨️ Code&Dataset: https://github.com/Rainyrou/ASTER, Special thanks to @Rainrou for reprodution

</aside>

Overview

We observe that in Tool-Integrated Reasoning (TIR, where the model can interact with code tools), existing training recipes such as ReTool and Zero-TIR often suffer from interaction collapse and fail to fully exploit the potential of code tools. After analyzing factors such as cold-start, interaction density, and reasoning budget, we propose a new recipe based on continued training of Qwen3-4B. Compared to prior methods, our approach significantly lifts the model’s capability ceiling. With only 4B parameters, our trained model reaches the level of top-tier large models like DeepSeek v3.2-exp on classic math benchmarks such as AIME24/25 and HMMT.

image.png

Figure 1: AIME 2025 vs parameter scale. Our model ASTER-4B achieves 90.0, not only far surpassing models in the same size bracket, but also matching frontier models with 100× more parameters such as DeepSeek-V3.2-exp (671B), MiniMax-M2.5(~230B), demonstrating a clear efficiency advantage.

image.png

Figure 2: Comparison of our training recipe with the baseline and prior methods.

TL;DR

Background: Why doesn’t RL for TIR scale?

Over the past year, RL has been highly successful for long-horizon reasoning (e.g., the “slow thinking” behaviors shown by systems like o1 / R1). But text-only reasoning is inherently fragile: a small mistake can be amplified along long chains, and there is no external feedback to “pin down” intermediate steps.

The intuition behind Tool-Integrated Reasoning (TIR) is straightforward: delegate precise computation / intermediate verification / repeated trial-and-error to tools (most commonly Python execution), while the model focuses on planning, modeling, explanation, and correction. However, when continuing to push TIR models via RL, a common failure mode is Interaction Collapse: during training, the model gradually reduces multi-round tool use, moves most computation and derivation back into text, and ends up with only one (or very few) code calls for verification. We guess this collapse comes from both data priors and optimization dynamics: