view article Article There is no such thing as a tokenizer-free lunch catherinearnett • Sep 25, 2025 • 98
view article Article Open Source AI: A Cornerstone of Digital Sovereignty frimelle • Jun 11, 2025 • 20
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5, 2025 • 61
Common Pile v0.1 Collection All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text • 4 items • Updated Jun 6, 2025 • 40
view article Article Blazingly fast whisper transcriptions with Inference Endpoints +4 mfuntowicz, freddyaboulton, Steveeeeeeen, reach-vb, erikkaum, michellehbn • May 13, 2025 • 82
view article Article Open-R1: a fully open reproduction of DeepSeek-R1 +1 eliebak, lvwerra, lewtun • Jan 28, 2025 • 889
view article Article Releasing the largest multilingual open pretraining dataset Pclanglais • Nov 13, 2024 • 107
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B PandorAI1995 • Oct 18, 2024 • 17
HPLT 1.2 Uni-Direction Translation Models Collection HPLT's MT releases. https://github.com/hplt-project/HPLT-MT-Models • 64 items • Updated Mar 2 • 2
OpenCulture Collection A multilingual dataset of public domain books and newspapers. • 25 items • Updated Mar 2 • 134