DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning Paper • 2511.22570 • Published Nov 27, 2025 • 86
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance Paper • 2509.23730 • Published Sep 28, 2025 • 2
ShoppingComp: Are LLMs Really Ready for Your Shopping Cart? Paper • 2511.22978 • Published Nov 28, 2025 • 3
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks Paper • 2508.15804 • Published Aug 14, 2025 • 15
PaperBench: Evaluating AI's Ability to Replicate AI Research Paper • 2504.01848 • Published Apr 2, 2025 • 36
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback Paper • 2503.22230 • Published Mar 28, 2025 • 45