Command Palette
Search for a command to run...
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Abstract
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeek-V3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20× fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
One-sentence Summary
Nemotron-Cascade 2 is an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities while achieving Gold Medal-level performance in the 2025 International Mathematical Olympiad, International Olympiad in Informatics, and ICPC World Finals with 20× fewer parameters than frontier models through expanded Cascade RL and multi-domain on-policy distillation, accompanied by the release of model checkpoints and training data.
Key Contributions
- Nemotron-Cascade 2 is an open 30B MoE model with 3B activated parameters that achieves Gold Medal-level performance in the 2025 IMO, IOI, and ICPC World Finals. This work demonstrates remarkably high intelligence density with 20 times fewer parameters than comparable frontier open models.
- Cascade RL is substantially expanded to cover a much broader spectrum of reasoning and agentic domains after supervised fine-tuning on a meticulously curated dataset. This expansion allows the system to sustain strong performance gains across diverse tasks.
- Multi-domain on-policy distillation from the strongest intermediate teacher models is introduced throughout the Cascade RL process to efficiently recover benchmark regressions. The model weights, training data, and methodological details are fully open sourced to enable community reproduction.
Introduction
Reinforcement learning has become essential for enhancing LLM reasoning and agentic capabilities, yet scaling these processes across diverse real-world tasks often destabilizes training. Previous Cascade RL frameworks mitigated catastrophic forgetting but struggled to maintain performance gains when navigating increasingly complex environments. The authors address these challenges with Nemotron-Cascade 2, an open 30B MoE model that expands Cascade RL to a broader spectrum of domains. They integrate multi-domain on-policy distillation from intermediate teacher models to recover benchmark regressions and sustain performance gains. This approach enables the compact model to achieve gold medal-level performance in mathematics and coding with 20 times fewer parameters than frontier competitors.
Dataset
- Mathematics: Non-proof prompts are sourced from Nemotron-Cascade and Nemotron-Math-v2, totaling 4.4M samples split between 1.8M tool-calling and 2.6M non-tool categories. Proof data expands 98K AOPS problems into 816K samples for generation and verification using DeepSeek-V3.2-Speciale.
- Code Reasoning: Approximately 165K unique coding prompts from competitive platforms undergo strict deduplication, removing 24.2% of redundant entries. The final set includes 1.9M Python and 1.0M C++14 reasoning traces, plus 1.3M Python tool-calling traces. Scientific coding adds 1.1M samples across biology, physics, and chemistry.
- Science and Long Context: Science prompts combine 1.4M samples from Nemotron-Cascade with 1.3M from Nemotron-3-Nano. Long context training leverages 160K samples with an average length of 128K tokens and 74K samples averaging 29K tokens.
- General Chat and Instructions: General chat data comprises 4.9M reasoning-on and 372K reasoning-off samples, augmented by 700K synthesized multi-turn conversations. Instruction following subsets prioritize objective verifiability, drawing from Nemotron-Cascade and Nemotron-3-Nano to ensure strict adherence to constraints.
- Agentic Tasks: Conversational tool-use includes 822K multi-turn samples. Software Engineering data mixes 125K agentic trajectories with 389K agentless samples for tasks like code repair. Terminal agent tasks total 490K samples created via the Terminal-Task-Gen methodology.
- Safety and RL: Safety training uses 4K samples to model appropriate refusal behavior. Code Reinforcement Learning filters the Nemotron-Cascade corpus down to 3.5K high-difficulty prompts where the teacher model failed to solve correctly in all rollouts.
- Processing Details: The pipeline employs I/O fingerprinting and n-gram analysis for deduplication. Teacher models like GPT-OSS-120B verify correctness for coding traces, while multi-turn dialogues are synthesized using role-playing setups to prevent repetitive exchanges.
Method
The training framework begins with Supervised Fine-Tuning (SFT) to equip the model with foundational capabilities across mathematics, coding, science, and agentic tasks. The authors employ a specific chat template designed to streamline interaction modes. They remove explicit thinking tags for simplicity and instead prepend an empty block to activate non-thinking mode. For tool-calling tasks, the system prompt explicitly lists available functions, instructing the model to wrap tool invocations within specific tags.
Following SFT, the authors implement a Cascaded Reinforcement Learning (Cascade RL) pipeline. This sequential, domain-wise ordering is designed to mitigate catastrophic forgetting as the model interacts with increasingly diverse environments. The process starts with Instruction-Following RL (IF-RL) to establish foundational instruction adherence. This is followed by Multi-domain RL to enhance tool-calling, STEM reasoning, and response format adherence. To unify specialized expertise and mitigate performance degradation, the pipeline incorporates Multi-domain On-policy Distillation (MOPD). Subsequent stages include Reinforcement Learning from Human Feedback (RLHF) for alignment, Long-context RL for reasoning over massive sequences, Code RL for competitive coding, and Software Engineering RL for mastering agentic software interactions.
The ordering of these stages is determined by the need to minimize negative interference across domains. For instance, IF-RL is prioritized early because it can negatively impact human alignment, which is subsequently recovered by RLHF. MOPD serves as a critical stabilization point to recover benchmark performance that may have regressed during specialized stages. Throughout the entire Cascade RL process, the authors use the Group Relative Policy Optimization (GRPO) algorithm with strict on-policy training. They remove the KL divergence term to simplify the objective to a standard REINFORCE objective with group-normalized rewards and token-level loss. The loss function is defined as:
IGRPO(θ)=E(q,a)∼D,{oi}i=1G∼πθ(⋅∣q)∑i=1G∣oi∣1i=1∑Gt=1∑∣oi∣A^i,twhere the advantage A^i,t is computed using group-normalized rewards. For RLHF, the reward is aggregated from a generative reward model, while for other domains, it is verified against ground-truth answers.
Experiment
Nemotron-Cascade-2 was evaluated across a comprehensive suite of benchmarks covering mathematical reasoning, coding, long-context understanding, and alignment to validate the effectiveness of its Cascade RL and MOPD training pipeline. The model achieves gold-medal performance on top-tier competitions such as IMO 2025 and IOI 2025, outperforming larger frontier models despite its compact 30B MoE scale. Furthermore, the experiments demonstrate substantial training efficiency advantages over standard methods and show that agentless reinforcement learning generalizes code repair capabilities to broader agentic tasks.
The authors evaluate the Nemotron-Cascade-2-30B-A3B model on the LiveCodeBench Pro 25Q1 Medium benchmark. The results demonstrate that this compact 30B parameter model significantly outperforms similarly sized competitors while achieving performance comparable to much larger models with hundreds of billions of parameters. Nemotron-Cascade-2-30B-A3B achieves performance comparable to the leading models on the medium difficulty coding benchmark. The model demonstrates superior efficiency by matching the results of significantly larger models like Kimi-K2.5-1T. It shows a substantial improvement over the Qwen3.5-35B-A3B model of similar parameter count.
The authors evaluate the Nemotron-Cascade-2 model on three prestigious competitions: IMO 2025, IOI 2025, and ICPC World Finals 2025. The model achieves gold medal status in all three events, demonstrating strong capabilities in mathematical reasoning and competitive coding despite its compact size. The model secures a Gold medal in the IMO 2025 competition by successfully solving the majority of the problems. In the IOI 2025 competition, the model achieves a Gold medal with a high overall score, showing high proficiency in algorithmic problem solving. For the ICPC World Finals 2025, the model solves nearly all problems to earn a Gold medal, validating its effectiveness in complex coding environments.
The authors evaluate the model on 40 Codeforces rounds spanning Div 1 and Div 2 categories, detailing problem-level scores and aggregate metrics. The results indicate consistent success on easier problems and top-tier estimated rankings in numerous Div 2 contests. Furthermore, the model achieves significantly higher ELO ratings in Div 1 rounds, demonstrating strong reasoning capabilities on difficult tasks. The model secures first-place estimated ranks in multiple Div 2 contests. Div 1 rounds generally yield higher ELO ratings than Div 2 rounds. The model consistently solves the initial easy problems across nearly all contests.
The authors evaluate the Nemotron-Cascade-2-30B-A3B model on competitive coding benchmarks including LiveCodeBench and Codeforces. Results indicate that the model achieves strong performance comparable to much larger frontier models, particularly when utilizing Tool-Integrated Reasoning. The data demonstrates that enabling tool use significantly improves success rates on difficult problems where other models often struggle. The model with Tool-Integrated Reasoning achieves non-zero accuracy on hard coding problems where larger baseline models frequently score zero. Performance on LiveCodeBench and Codeforces benchmarks rivals or exceeds that of significantly larger open-source models with over 100 billion parameters. Enabling tool integration consistently boosts scores across Easy, Medium, and Hard difficulty categories compared to the standard configuration.
The authors evaluate the Nemotron-Cascade-2-30B-A3B model against several strong baselines, showing it achieves state-of-the-art performance in mathematical and coding reasoning tasks. Results indicate this model outperforms both a larger 120B parameter variant and a similarly sized Qwen3.5 model across most reasoning and alignment benchmarks. However, the data suggests it lags behind the Qwen3.5 baseline in specific knowledge-intensive and agentic domains. The model achieves gold-medal level results in major mathematical and coding competitions like IMO 2025 and IOI 2025. It consistently outperforms the larger Nemotron-3-Super-120B-A12B model in math and code reasoning categories. The model shows superior instruction following and alignment scores compared to the Qwen3.5-35B-A3B baseline.
The Nemotron-Cascade-2-30B-A3B model was evaluated on prestigious benchmarks including LiveCodeBench, Codeforces, and major mathematical competitions like IMO and IOI 2025. Results indicate the compact model achieves gold medal status and outperforms similarly sized competitors while matching the performance of significantly larger frontier models, especially when utilizing Tool-Integrated Reasoning for difficult problems. Although it shows slight limitations in specific knowledge-intensive domains, the architecture delivers state-of-the-art efficiency and alignment across complex coding and mathematical tasks.