Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay Paper • 2506.05316 • Published Jun 5, 2025 • 1
rlvr-weak-supervision Collection Models from "When Can LLMs Learn to Reason with Weak Supervision?" — Llama-3.2-3B with continual pre-training and Thinking SFT. • 3 items • Updated 6 days ago • 1
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning Paper • 2505.24846 • Published May 30, 2025 • 15