DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Abstract
Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.
Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.
Community
We propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones.
i really like the idea of turning multi-reward rl into a dynamic, variance-aware game instead of brittle fixed weights. the core move — weighting per-objective signal by its rollout variance to upweight stronger learning signals while dampening noise — feels like a practical fix to the exploding gradients problem in reward combinations. my one gripe is how well this captures cross-objective correlations in practice: when objectives are correlated, variance shrinks and the weights might underrepresent synergistic signals; would love an ablation where you disable the cross-objective coupling to see the isolated effect. the arxivlens breakdown helped me parse the method details and covers the variance-adaptive part well, here: https://arxivlens.com/PaperView/Details/dvao-dynamic-variance-adaptive-advantage-optimization-for-multi-reward-reinforcement-learning-7084-8c9fb3eb. overall, the claimed bounded advantages and implicit regularization match the empirical gains they report, but i want to see how this scales to even messier, real-world reward structures beyond math reasoning and tool use.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification (2026)
- Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning (2026)
- expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling (2026)
- One-Way Policy Optimization for Self-Evolving LLMs (2026)
- RVPO: Risk-Sensitive Alignment via Variance Regularization (2026)
- Policy Improvement Reinforcement Learning (2026)
- Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.25604 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper