DenseR: Dense Rewards For Free in LLM Reasoning

Community Article Published February 18, 2026

GRPO works. You sample a bunch of completions, check which ones got the right answer, and push the model toward the winners. No reward model, no critic network — just rollouts and a binary signal. It's the backbone of DeepSeek-R1 and a growing list of open reasoning models.

But here's the thing: when a model writes 500 tokens to solve a math problem and gets it right, GRPO rewards every single token equally. The brilliant insight on line 3? Same reward as the filler "Let me think step by step" on line 1. And when another completion gets the answer wrong, every token gets penalized — including the first four steps that were perfectly correct before an arithmetic slip on step five.

GRPO has no way to tell which tokens actually mattered. Consider this example:

GRPO assigns A = +1 to y₁ and y₂, and A = −1 to y₃ and y₄. Every token in y₃ gets the same penalty — but "Subtract 3 from both sides" is a perfectly correct first step! The actual mistake is "2x = 10" — one token where the model slipped up. Meanwhile, y₂ found a genuinely different approach to the answer, but GRPO gives it identical credit to y₁. And y₄, which used a fundamentally wrong strategy from the start, gets the same flat penalty as y₃'s arithmetic slip.

If we look carefully, then there's actually a rich signal hiding in plain sight:

y₁ and y₃ start identically — "Subtract 3 from both sides: 2x = ..." — then diverge at a single token ("4" vs "10"). That divergence point is exactly where y₃ went wrong. If the model could see this, it would know to focus the penalty there, not on the correct setup.
y₂ is doing something different from y₁, even though both are correct. That uniqueness is valuable — it means y₂ is contributing a reasoning strategy the model hasn't already reinforced through y₁.
y₄ is doing something different from y₃ — not just a small error, but a completely wrong approach. That's a worse kind of failure and deserves a sharper penalty.

This is all information that GRPO throws away. The rollouts already contain token-level structure about where the reasoning paths split, which correct completions are novel, and how different the failures are from each other. It's free (and dense) supervision!.

In this post, I propose a simple and intuitive approach: DenseR! The key observation 💡 is that at every token, the model produces an internal representation — a snapshot of its "thinking" at that point. When two completions are reasoning similarly, these snapshots look alike. When one completion goes right and the other goes wrong, the snapshots suddenly look very different. That's your decision point.

So DenseR simply asks: for each token in a completion, how different was the model's internal state from completions that ended up on the other side? Tokens where the answer is "very different" get more weight. Tokens where the answer is "basically the same" get less. The result is that GRPO's flat, per-completion advantage becomes a shaped, per-token signal — without any additional models, annotations, or inference cost.

How GRPO Learns (and Where It Falls Short)

Given a prompt $x$ , GRPO samples a group of $G$ completions $\{y_1, y_2, \ldots, y_G\}$ from the current policy $\pi_{\theta_{\text{old}}}$ and scores each with a binary reward — correct or wrong. The advantage for each completion is:

$A_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$

The GRPO objective maximizes a clipped surrogate, similar to PPO but without a value network:

$\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{T} \sum_{t=1}^{T} \min\left( \rho_{i,t} \cdot A_i, \; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \cdot A_i \right)$

where $\rho_{i,t}$ is the per-token importance ratio between the updated policy and the old policy:

$\rho_{i,t} = \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})}$

When the number of optimization steps per generation is 1, $\pi_\theta = \pi_{\theta_{\text{old}}}$ at the start of the update, so $\rho_{i,t} = 1$ and the clipping has no effect. The gradient simplifies to:

$\nabla \mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{T} \sum_{t=1}^{T} A_i \nabla \log \pi_\theta(y_{i,t} \mid x, y_{i,<t})$

Let's make this concrete. Consider the prompt "Solve $2 x + 3 = 7$ " with $G = 4$ completions:

	Completion	Reward	Advantage
$y_{1}$ ✓	"Subtract 3 from both sides: $2 x = 4$ . Divide by 2: $x = 2$ "	1	+1.0
$y_{2}$ ✓	"Move 3 to the right: $2 x = 7 - 3 = 4$ , so $x = 4 / 2 = 2$ "	1	+1.0
$y_{3}$ ✗	"Subtract 3 from both sides: $2 x = 10$ . Divide by 2: $x = 5$ "	0	−1.0
$y_{4}$ ✗	"Divide everything by 2: $x + 3 = 3.5$ , so $x = 0.5$ "	0	−1.0

Notice the problem: $A_{i}$ is a single scalar that multiplies every token equally. For $y_{1}$ , the word "Subtract" gets the same advantage multiplier as the critical arithmetic step "$2x = 4$" — even though one is boilerplate setup and the other is where the answer was actually determined. For $y_{3}$ , the same setup "Subtract 3 from both sides" is also penalized with the same multiplier — even though that step was correct; only "$2x = 10$" was wrong.

This is sparse supervision: one reward signal, spread uniformly across hundreds of tokens.

DenseR: Making It Dense

DenseR introduces a per-token weight $w_{i,t}$ that modulates how much each token contributes to learning. The objective becomes:

$\mathcal{L}_{\text{DenseR}} = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{T} \sum_{t=1}^{T} \min\left( \rho_{i,t} \cdot w_{i,t} \cdot A_i, \; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \cdot w_{i,t} \cdot A_i \right)$

Compared to GRPO, the only change is the $w_{i,t}$ factor inside the objective. We can equivalently define a token-level advantage $A_{i,t} = A_i \cdot w_{i,t}$ , so when the number of steps per generation is 1, the gradient simplifies to:

$\nabla \mathcal{L}_{\text{DenseR}} = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{T} \sum_{t=1}^{T} A_{i,t} \cdot \nabla \log \pi_\theta(y_{i,t} \mid x, y_{i,<t})$

The weights are computed by contrasting the model's own hidden representations across completions — no extra model and no annotation. Since the hidden states are already computed during the generation forward pass, the only overhead is the similarity computation itself. The weights are normalized to mean 1 per completion, so the total gradient magnitude stays the same. DenseR only redistributes where that gradient goes.

How the weights are computed

The weight $w_{i,t}$ is a blend between uniform and a contrastive signal:

$w_{i,t} = \beta \cdot \hat{c}_{i,t} + (1 - \beta)$

where $\hat{c}_{i,t}$ is the mean-1 normalized contrastive signal and $\beta$ controls how much redistribution to apply. At $\beta = 0$ , every token gets equal weight (vanilla GRPO). At $\beta = 1$ , fully guided by the contrastive signal.

The contrastive signal $c_{i,t}$ itself combines two components:

$c_{i,t} = \alpha_{\text{cross}} \cdot d^{\text{cross}}_{i,t} + \alpha_{\text{within}} \cdot d^{\text{within}}_{i,t}$

The raw scores $c_{i,t}$ are then normalized to have mean 1 within each completion, giving $\hat{c}_{i,t}$ :

$\hat{c}_{i,t} = \frac{c_{i,t}}{\frac{1}{T} \sum_{t'=1}^{T} c_{i,t'}}$

Here, the $\alpha_{\text{cross}}$ and $\alpha_{\text{within}}$ control the relative influence of cross-class and within-class divergence. Setting $\alpha_{\text{cross}} = 0$ uses only within-class uniqueness; setting $\alpha_{\text{within}} = 0$ uses only cross-class divergence. The balance between the two determines whether the model focuses more on identifying where correct and incorrect paths split apart, or on upweighting novel strategies and distinct failure modes.

How do we measure these? At each token position, the model's last decoder layer produces a hidden state vector — a high-dimensional snapshot of what the model is "thinking" at that point. We compare these snapshots across completions using cosine similarity: if two tokens have similar hidden states, the model is reasoning similarly at those positions. If they differ, something changed.

Since completions can vary in length, we use a windowed alignment scheme. When computing divergence between two completions, each token is compared against a window of ±5 positions around its proportionally aligned position in the other completion. For example, if token 50 in a 100-token completion is being compared against an 80-token completion, its aligned center is position 40 (50 × 80/100), and it searches positions 35–45 for the best match — 11 candidates total. This keeps comparisons local and meaningful even when completions differ substantially in length.

Cross-class divergence: finding the decision point

$d^{\text{cross}}_{i,t}$ measures how different token $t$ 's hidden representation is from completions of the opposite class.

Back to our example — compare $y_{1}$ (correct) with $y_{3}$ (wrong):

Token by token, the hidden states are nearly identical through "Subtract 3 from both sides: 2x = ..." — same reasoning, same approach. At "$= 4$" vs "$= 10$" the representations diverge sharply. Everything after is different but just a consequence of that one step.

Cross-class divergence is low for the shared setup tokens and high from the decision point onward. It identifies where the correct path separated from the wrong one.

Within-class uniqueness: finding novel strategies

$d^{\text{within}}_{i,t}$ measures how different token $t$ 's hidden representation is from other completions in the same class.

Now compare the two correct and two wrong completions:

$y_{1}$ and $y_{2}$ use similar approaches — their within-class uniqueness is moderate. $y_{4}$ uses a completely different (and wrong) strategy from $y_{3}$ — its within-class uniqueness is high. This means the model will penalize $y_{4}$ 's unique mistake more strongly: "dividing everything by 2 first" is a distinct failure mode worth learning to avoid.

On the positive side, if one correct completion used a creative shortcut that no other correct completion shared, within-class uniqueness would boost it — reinforcing a novel strategy the model hasn't already learned from the other completions.

Experimental Setup

I train Qwen3-0.6B and 4B base models using GRPO with and without DenseR. Training uses 1,000 examples from open-r1/DAPO-Math-17k-Processed for a single epoch (500 RL steps), with 8 rollouts per prompt on 2 A6000 (for 0.6B) and 2 H100 (for 4B) GPUs.

DenseR defaults: For the 0.6B model, $\alpha_{\text{cross}} = 1$ , $\alpha_{\text{within}} = 0.3$ . Through a few small scale experiments, I found increasing the strength of cross-class signal relative to within-class was slightly beneficial. For the 4B model, $\alpha_{\text{cross}} = 1$ , $\alpha_{\text{within}} = 1$ . I set $\beta = 0.1$ , and window size = 5 for both models.

Learning rate: I choose $1 \times 10^{-6}$ for the 0.6B model, $5 \times 10^{-7}$ for the 4B model.

Evaluation benchmarks: MATH500, AIME24, AIME25, and AMC23. I report pass@k and majority_vote@k as metrics.

Results

Figure: Comparison between GRPO and DenseR on Qwen3-0.6B-Base.

Figure: Comparison between GRPO and DenseR on Qwen3-4B-Base.

On the 0.6B model, DenseR lifts MATH500 pass@1 from 32.7% to 37.9% and AMC23 pass@1 from 17.2% to 18.8%. The gap widens dramatically on harder benchmarks: AIME24 pass@1 jumps from 0.2% to 2.5%, and pass@16 from 3.3% to 23.3% — a 7× improvement. DenseR even solves AIME25 problems (pass@16 = 10%) where GRPO scores a flat zero across all k. On the 4B model, pass@1 is roughly matched, but DenseR pulls ahead at higher k: AIME25 pass@16 improves from 23.3% to 30.0%, and AIME24 pass@16 from 23.3% to 33.3%.
DenseR's advantage is most pronounced on smaller models (0.6B) and harder benchmarks (AIME), suggesting it's better at extracting reasoning capability from limited model capacity.
The pass@k advantage indicates DenseR produces more diverse correct solutions.
Under majority voting, GRPO narrows the gap — its solutions are less diverse but more concentrated, which majority voting rewards.

How DenseR fits in the Bigger Picture?

Different training approaches deal with supervision differently, and the trade-offs boil down to three questions: Do you need a teacher model? What gives you the learning signal? And is that signal per-token (dense) or per-completion (sparse)?

Approach	Teacher?	What provides the signal?	Granularity
Off/On-policy distillation [1][2][3][4][5]	✅ Larger model	Teacher token likelihoods	Dense
Self-on-policy distillation [6][7]	✅ Self (answer-conditioned)	Self token likelihoods	Dense
GRPO [8][9]	❌	Answer correctness	Sparse
DenseR (ours)	❌	Answer correctness + Contrastive rollout divergence	Dense

Distillation gets you dense per-token signal, but it comes with baggage. Off-policy and on-policy distillation need a bigger teacher model — which might not exist if you're already at the frontier, or might reason in ways that don't transfer well to a smaller student. Self-on-policy distillation is more clever: the model teaches itself by conditioning on the known answer. But think about what that actually means. When the model generates token likelihoods knowing the answer is 42, it's explaining a solution it already has — not discovering one. That's a fundamentally different mode from reasoning without the answer. The signal tells you how would I write this if I already knew where I was going? rather than which steps actually got me there. It can reinforce existing patterns but can't tell you what made one blind reasoning attempt succeed where another failed.

GRPO avoids all of this — no teacher, no answer conditioning, just the model's own rollouts. But you pay for that simplicity with sparse rewards: every token in a completion gets the same advantage multiplier, whether it's a critical reasoning step or not. In this regard, DenseR keeps GRPO's simplicity (no teacher, unconditioned rollouts) while adding dense per-token signal.

From Research "On" AI to Research "With" AI

In the past few months, I have experienced a dramatic shift in AI capabilities that has changed how I view AI — from an "instruction-follower" to a "research companion." To me, there is a stark difference between the two. Until recently, I used AI only for code completion, LaTeX help, and implementations of familiar frameworks. In hindsight, those were tasks I could have done myself — AI just made them faster.

But a dramatic shift happened for me last week. As a researcher, I often get random ideas that I can't fully pursue due to time and resource limitations (I'm sure this is common). Some cross my mind and I never act on them. Some seem like garbage within minutes. If an idea still seems noteworthy, I check the literature. Finally, if I'm still motivated, I start building on prior work and develop my approach. If all goes well, my approach works, and I shape it into an academic paper.

However, last week, instead of diving into the literature or cloning an existing repo, I simply described my problem statement and how I thought it could be solved to Claude Opus 4.6 — and the conversation that unfolded was surreal. I received non-trivial feedback about the problem and the proposed solution — the kind I would expect from an experienced researcher. It felt like having a research companion I could bounce anything off of — with no shame in asking “stupid” questions — and I would occasionally learn something new.

Interestingly, most of Claude’s initial answers had issues that caught my eye, and I would push back — sometimes without fully articulating why. Claude would absorb the feedback and make elegant revisions to the proposed solution, driving the research discussion deeper. The entire back-and-forth motivated me to pursue this as a mini-project over the weekend, just to see how far I could push it. What started as a casual question turned into hours of conversation spanning the entire project cycle — from ideation to writing this blog.

At its core, many of the heavy execution tasks were offloaded to Claude while I played the role of expert verifier — babysitting training curves and telling Claude what seemed to work and what didn’t. I wouldn’t say Claude came up with the winning strategy on its own, but it compressed my proof-of-concept timeline from several weeks to several days, without a doubt. The experience was eye-opening enough that I felt compelled to document it. It feels like I have an on-call research buddy.

Acknowledgments

I am grateful to Ashima Suvarna for being a wonderful sounding board and for providing thoughtful feedback on this blog post. She saw how excited I was during all my conversations with Claude and constantly joked that I had been spending too much time with it over the past few days.

Author: Webpage · @hbXNov

Code: Github Repo

Cite

If you find this work useful, please cite:

@article{bansal2026denser,
  title={DenseR: Dense Reward For Free in LLM Reasoning},
  author={Bansal, Hritik},
  year={2026},
  url={https://huggingface.co/blog/hbXNov/denser}
}

References

[1] Hinton et al., "Distilling the Knowledge in a Neural Network", 2015. arXiv:1503.02531

[2] Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning", 2022. arXiv:2203.14465

[3] Bansal et al., "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling", 2024. arXiv:2408.16737

[4] Agarwal et al., "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes", ICLR 2024. arXiv:2306.13649

[5] Thinking Machines Lab, "On-Policy Distillation", 2025. Blog post

[6] Zhao et al., "Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models", 2026. arXiv:2601.18734

[7] Shenfeld et al., "Self-Distillation Enables Continual Learning", 2026. arXiv:2601.19897

[8] Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", 2024. arXiv:2402.03300

[9] DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", 2025. arXiv:2501.12948

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote