To read
updated
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
• 2312.00752
• Published
• 150
Elucidating the Design Space of Diffusion-Based Generative Models
Paper
• 2206.00364
• Published
• 18
GLU Variants Improve Transformer
Paper
• 2002.05202
• Published
• 4
StarCoder 2 and The Stack v2: The Next Generation
Paper
• 2402.19173
• Published
• 152
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper
• 2403.03507
• Published
• 189
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
• 2401.00908
• Published
• 189
Paper
• 2401.04088
• Published
• 160
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published
• 157
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
• 2405.00732
• Published
• 122
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
• 2405.09818
• Published
• 132
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
• 2404.07143
• Published
• 111
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
• 2403.03163
• Published
• 98
Gemma: Open Models Based on Gemini Research and Technology
Paper
• 2403.08295
• Published
• 50
Longformer: The Long-Document Transformer
Paper
• 2004.05150
• Published
• 4
WARP: On the Benefits of Weight Averaged Rewarded Policies
Paper
• 2406.16768
• Published
• 23
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Paper
• 2006.03654
• Published
• 3
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing
Paper
• 2111.09543
• Published
• 3
Direct Preference Optimization: Your Language Model is Secretly a Reward
Model
Paper
• 2305.18290
• Published
• 64
The Prompt Report: A Systematic Survey of Prompting Techniques
Paper
• 2406.06608
• Published
• 68
SimPO: Simple Preference Optimization with a Reference-Free Reward
Paper
• 2405.14734
• Published
• 12
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical
Applications
Paper
• 2409.07314
• Published
• 56
Qwen2.5-Coder Technical Report
Paper
• 2409.12186
• Published
• 153
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Paper
• 2112.06905
• Published
• 2
Paper
• 2505.09388
• Published
• 334
Wan: Open and Advanced Large-Scale Video Generative Models
Paper
• 2503.20314
• Published
• 59
Scalable Diffusion Models with Transformers
Paper
• 2212.09748
• Published
• 18
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of
Diffusion Models
Paper
• 2302.04867
• Published
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning
Attention
Paper
• 2506.13585
• Published
• 273
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models
Paper
• 2401.04658
• Published
• 27
Scaling TransNormer to 175 Billion Parameters
Paper
• 2307.14995
• Published
• 23
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published
• 78
Why do LLMs attend to the first token?
Paper
• 2504.02732
• Published
• 2
Qwen-Image Technical Report
Paper
• 2508.02324
• Published
• 272
Efficient Estimation of Word Representations in Vector Space
Paper
• 1301.3781
• Published
• 8
Titans: Learning to Memorize at Test Time
Paper
• 2501.00663
• Published
• 29
It's All Connected: A Journey Through Test-Time Memorization,
Attentional Bias, Retention, and Online Optimization
Paper
• 2504.13173
• Published
• 19