Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models
I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.
Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.
The article covers:
- How HashHop works and why its perfect accuracy is suspicious - Building a tokenized solver that achieves 100% accuracy - Scaling to MALM for real code search tasks - Why this approach could handle 100M+ tokens
Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!
Key findings from our research on optimal architectures for small language models:
→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count → Best-in-class factuality: 47.5% on TruthfulQA → 10x training efficiency using WSD (Warmup-Stable-Decay) conversion → Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Introducing PTS Visualizer - an interactive tool for exploring how language models reason!
Visualize pivotal tokens, thought anchors, and reasoning circuits. See which tokens and sentences significantly impact success probability, explore embedding clusters, and trace reasoning step-by-step.
Recently, Essential AI released a new 8B base model EssentialAI/rnj-1 they highlighted the importance of data mix for pretraning -
"In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this. "
I just published Ellora - 6 production-ready LoRA recipes for enhancing LLMs with specific capabilities. Each recipe costs under $100 to run and includes complete training code, data generation, and evaluation.
The 6 Recipes: Recipe 1: Accuracy Recovery - Recover 75% of quantization losses with self-distillation Recipe 2: Reasoning LoRA - Add structured thinking with GRPO (0% to 60% adoption, 75% quality boost) Recipe 3: Tool Calling - Real execution on actual codebases Recipe 4: Context Extension - Scale from 32K to 2M tokens (61x increase) Recipe 5: Secure Code Generation - 97% vulnerability reduction using automated Semgrep analysis Recipe 6: Execution-Aware World Models - Teaching models runtime behavior
Why Recipes? Ellora provides methodologies, not frameworks. Use them with your existing tools (PEFT, LoRAX, vLLM, Unsloth, HuggingFace). Each recipe uses self-supervised data generation (Magpie approach) - no expensive human labeling required.
All recipes include Jupyter notebooks you can run immediately with clear success metrics.
Introducing OpenEvolve Prompt Optimizer - a Space that automatically evolves and optimizes your prompts using OpenEvolve!
This tool uses OpenEvolve to iteratively improve prompts by testing them on real datasets and evolving better versions. No more manual prompt engineering guesswork - let OpenEvolve find the optimal prompts for you.
How it works: - Enter your initial prompt using {input} as a placeholder for dataset inputs - Input any HuggingFace dataset name you want to use for optimization - Specify the dataset split and field names for your use case - Click Optimize Prompt and the system will validate everything first - Compare your initial prompt vs the evolved best prompt side-by-side
🎯 Introducing Chayan: A Calibrated 4-Model LLM Router Achieving 69% Accuracy on RouterArena
We're excited to share Chayan, a cost-efficient LLM router that intelligently routes queries between 4 models to maximize accuracy while minimizing cost. Chayan just submitted to the RouterArena leaderboard and achieved 69.05% accuracy on the benchmark!
Chayan achieves impressive results on the RouterArena benchmark: • 69.05% accuracy (would rank #1 on current leaderboard) • $0.333 per 1K queries • +12.07pp improvement over all-mini baseline (56.98%) • 99% of perfect 2-model oracle performance at 57% lower cost
Compared to our previous 2-model router (61.43% accuracy), Chayan delivers +7.62pp improvement through smarter 4-model routing.
đź§ How It Works
Chayan uses an Adaptive K-NN classifier with prototype memory to route between 4 models: • openai/gpt-4o-mini (fast & cheap) • google/gemini-2.5-flash-lite (balanced) • google/gemini-2.5-flash (capable) • openai/gpt-4o (most powerful)
🚀 Getting Started
You can use Chayan directly from HuggingFace:
from adaptive_classifier import AdaptiveClassifier
These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.
The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources
We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.
Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.