GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published Jan 8 • 228
Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs Paper • 2506.17080 • Published Jun 20, 2025 • 7
ITA-Bench: Italian Benchmarks for LLMs Collection A collection of Italian benchmarks for Large Language Models. See also our Github repo: https://github.com/SapienzaNLP/ita-bench • 22 items • Updated 2 days ago • 8
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA Paper • 2510.13494 • Published Oct 15, 2025 • 2
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published Aug 14, 2025 • 60
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering Paper • 2503.14996 • Published Mar 19, 2025 • 3
ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering Paper • 2410.05077 • Published Oct 7, 2024 • 5
Reward Bench 2 Collection Datasets, spaces, and models for Reward Bench 2 benchmark and paper! • 11 items • Updated Dec 23, 2025 • 16
CLIPPER Collection Models and datasets for CLIPPER: Compression enables long-context synthetic data generation • 7 items • Updated Oct 3, 2025 • 5
MMTEB: Massive Multilingual Text Embedding Benchmark Paper • 2502.13595 • Published Feb 19, 2025 • 45
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS Paper • 2411.19655 • Published Nov 29, 2024 • 20
Minerva LLMs Collection The first family of LLMs pretrained from scratch on Italian. • 6 items • Updated Dec 7, 2024 • 40