To read - a jmagder Collection

jmagder 's Collections

Finished Reading

To read

updated Dec 16, 2025

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper • 2312.00752 • Published Dec 1, 2023 • 150
Elucidating the Design Space of Diffusion-Based Generative Models

Paper • 2206.00364 • Published Jun 1, 2022 • 18
GLU Variants Improve Transformer

Paper • 2002.05202 • Published Feb 12, 2020 • 4
StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29, 2024 • 152
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Paper • 2403.03507 • Published Mar 6, 2024 • 189
DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
Mixtral of Experts

Paper • 2401.04088 • Published Jan 8, 2024 • 160
Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 157
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29, 2024 • 122
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 132
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Paper • 2404.07143 • Published Apr 10, 2024 • 111
Design2Code: How Far Are We From Automating Front-End Engineering?

Paper • 2403.03163 • Published Mar 5, 2024 • 98
Gemma: Open Models Based on Gemini Research and Technology

Paper • 2403.08295 • Published Mar 13, 2024 • 50
Longformer: The Long-Document Transformer

Paper • 2004.05150 • Published Apr 10, 2020 • 4
WARP: On the Benefits of Weight Averaged Rewarded Policies

Paper • 2406.16768 • Published Jun 24, 2024 • 23
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Paper • 2006.03654 • Published Jun 5, 2020 • 3
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Paper • 2111.09543 • Published Nov 18, 2021 • 3
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 64
The Prompt Report: A Systematic Survey of Prompting Techniques

Paper • 2406.06608 • Published Jun 6, 2024 • 68
SimPO: Simple Preference Optimization with a Reference-Free Reward

Paper • 2405.14734 • Published May 23, 2024 • 12
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Paper • 2409.07314 • Published Sep 11, 2024 • 56
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 153
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Paper • 2112.06905 • Published Dec 13, 2021 • 2
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 334
Wan: Open and Advanced Large-Scale Video Generative Models

Paper • 2503.20314 • Published Mar 26, 2025 • 59
Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 18
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Paper • 2302.04867 • Published Feb 9, 2023
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Paper • 2506.13585 • Published Jun 16, 2025 • 273
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

Paper • 2401.04658 • Published Jan 9, 2024 • 27
Scaling TransNormer to 175 Billion Parameters

Paper • 2307.14995 • Published Jul 27, 2023 • 23
OmniGen2: Exploration to Advanced Multimodal Generation

Paper • 2506.18871 • Published Jun 23, 2025 • 78
Why do LLMs attend to the first token?

Paper • 2504.02732 • Published Apr 3, 2025 • 2
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4, 2025 • 272
Efficient Estimation of Word Representations in Vector Space

Paper • 1301.3781 • Published Jan 16, 2013 • 8
Titans: Learning to Memorize at Test Time

Paper • 2501.00663 • Published Dec 31, 2024 • 29
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Paper • 2504.13173 • Published Apr 17, 2025 • 19