arxiv:2605.25604

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Published on May 25

· Submitted by

JGC on May 26

#1 Paper of the day

Upvote

125

Authors:

Guochao Jiang ,

Guofeng Quan ,

Chuzhan Hao ,

Abstract

Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.

AI-generated summary

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

View arXiv page View PDF Add to collection

Community

Nothing2Say

Paper author Paper submitter 2 days ago

We propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones.

avahal

1 day ago

i really like the idea of turning multi-reward rl into a dynamic, variance-aware game instead of brittle fixed weights. the core move — weighting per-objective signal by its rollout variance to upweight stronger learning signals while dampening noise — feels like a practical fix to the exploding gradients problem in reward combinations. my one gripe is how well this captures cross-objective correlations in practice: when objectives are correlated, variance shrinks and the weights might underrepresent synergistic signals; would love an ablation where you disable the cross-objective coupling to see the isolated effect. the arxivlens breakdown helped me parse the method details and covers the variance-adaptive part well, here: https://arxivlens.com/PaperView/Details/dvao-dynamic-variance-adaptive-advantage-optimization-for-multi-reward-reinforcement-learning-7084-8c9fb3eb. overall, the claimed bounded advantages and implicit regularization match the empirical gains they report, but i want to see how this scales to even messier, real-world reward structures beyond math reasoning and tool use.