Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers Paper • 2601.17367 • Published Jan 24 • 34
Small-scale proxies for large-scale Transformer training instabilities Paper • 2309.14322 • Published Sep 25, 2023 • 22
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training Paper • 2602.00747 • Published Jan 31 • 9
HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing Paper • 2602.03560 • Published Feb 3 • 48
FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach Paper • 2603.13364 • Published 18 days ago • 9
When Does Sparsity Mitigate the Curse of Depth in LLMs Paper • 2603.15389 • Published 10 days ago • 5
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks Paper • 2603.11487 • Published 15 days ago • 2