cool-papers
updated
Unlocking the conversion of Web Screenshots into HTML Code with the
WebSight Dataset
Paper
• 2403.09029
• Published
• 56
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic
Prompt Compression
Paper
• 2403.12968
• Published
• 25
RAFT: Adapting Language Model to Domain Specific RAG
Paper
• 2403.10131
• Published
• 72
Quiet-STaR: Language Models Can Teach Themselves to Think Before
Speaking
Paper
• 2403.09629
• Published
• 79
Simple and Scalable Strategies to Continually Pre-train Large Language
Models
Paper
• 2403.08763
• Published
• 51
Language models scale reliably with over-training and on downstream
tasks
Paper
• 2403.08540
• Published
• 15
Algorithmic progress in language models
Paper
• 2403.05812
• Published
• 19
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
• 2403.05530
• Published
• 65
TnT-LLM: Text Mining at Scale with Large Language Models
Paper
• 2403.12173
• Published
• 20
Larimar: Large Language Models with Episodic Memory Control
Paper
• 2403.11901
• Published
• 33
Reverse Training to Nurse the Reversal Curse
Paper
• 2403.13799
• Published
• 13
When Do We Not Need Larger Vision Models?
Paper
• 2403.13043
• Published
• 26
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published
• 53
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow
Instructions
Paper
• 2403.15246
• Published
• 11
Can large language models explore in-context?
Paper
• 2403.15371
• Published
• 33
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published
• 82
Long-form factuality in large language models
Paper
• 2403.18802
• Published
• 26
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
Paper
• 2403.18421
• Published
• 23
Octopus v2: On-device language model for super agent
Paper
• 2404.01744
• Published
• 58
Poro 34B and the Blessing of Multilinguality
Paper
• 2404.01856
• Published
• 15
Long-context LLMs Struggle with Long In-context Learning
Paper
• 2404.02060
• Published
• 37
Training LLMs over Neurally Compressed Text
Paper
• 2404.03626
• Published
• 23
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
• 2404.03543
• Published
• 18
Language Models as Compilers: Simulating Pseudocode Execution Improves
Algorithmic Reasoning in Language Models
Paper
• 2404.02575
• Published
• 50
Toward Self-Improvement of LLMs via Imagination, Searching, and
Criticizing
Paper
• 2404.12253
• Published
• 55
Compression Represents Intelligence Linearly
Paper
• 2404.09937
• Published
• 28
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
• 2204.14198
• Published
• 16
Executable Code Actions Elicit Better LLM Agents
Paper
• 2402.01030
• Published
• 188
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language
Models
Paper
• 2406.04271
• Published
• 29
To Believe or Not to Believe Your LLM
Paper
• 2406.02543
• Published
• 35
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper
• 2406.06469
• Published
• 29
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
• 2406.19280
• Published
• 63
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Paper
• 2407.01370
• Published
• 89
Planetarium: A Rigorous Benchmark for Translating Text to Structured
Planning Languages
Paper
• 2407.03321
• Published
• 20
Training Task Experts through Retrieval Based Distillation
Paper
• 2407.05463
• Published
• 10
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with
Inverse-Instruct
Paper
• 2407.05700
• Published
• 14
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Paper
• 2407.15711
• Published
• 9
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Paper
• 2407.20183
• Published
• 43
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
• 2408.06663
• Published
• 16
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper
• 2408.10914
• Published
• 45
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published
• 26
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
• 2409.01704
• Published
• 83
ContextCite: Attributing Model Generation to Context
Paper
• 2409.00729
• Published
• 14
Attention Heads of Large Language Models: A Survey
Paper
• 2409.03752
• Published
• 92
A Controlled Study on Long Context Extension and Generalization in LLMs
Paper
• 2409.12181
• Published
• 45
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic
reasoning
Paper
• 2409.12183
• Published
• 39
Attention Prompting on Image for Large Vision-Language Models
Paper
• 2409.17143
• Published
• 7
Can Models Learn Skill Composition from Examples?
Paper
• 2409.19808
• Published
• 9
Law of the Weakest Link: Cross Capabilities of Large Language Models
Paper
• 2409.19951
• Published
• 54
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published
• 37
From Medprompt to o1: Exploration of Run-Time Strategies for Medical
Challenge Problems and Beyond
Paper
• 2411.03590
• Published
• 10
Autoregressive Models in Vision: A Survey
Paper
• 2411.05902
• Published
• 19
Stronger Models are NOT Stronger Teachers for Instruction Tuning
Paper
• 2411.07133
• Published
• 38
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
• 2501.05452
• Published
• 15
Demystifying Domain-adaptive Post-training for Financial LLMs
Paper
• 2501.04961
• Published
• 11
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
• 2412.19723
• Published
• 87
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
• 2501.04003
• Published
• 27
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
• 2501.00192
• Published
• 31
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs)
More Self-Confident Even When They Are Wrong
Paper
• 2501.09775
• Published
• 32
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published
• 27
CodeMonkeys: Scaling Test-Time Compute for Software Engineering
Paper
• 2501.14723
• Published
• 10
s1: Simple test-time scaling
Paper
• 2501.19393
• Published
• 124
LIMO: Less is More for Reasoning
Paper
• 2502.03387
• Published
• 62
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Paper
• 2502.01941
• Published
• 14
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
• 2502.03373
• Published
• 58
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Paper
• 2502.01081
• Published
• 13
Great Models Think Alike and this Undermines AI Oversight
Paper
• 2502.04313
• Published
• 33
Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision
Models
Paper
• 2502.06755
• Published
• 8
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Paper
• 2502.07445
• Published
• 11
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Paper
• 2502.05167
• Published
• 15
From RAG to Memory: Non-Parametric Continual Learning for Large Language
Models
Paper
• 2502.14802
• Published
• 13
Is That Your Final Answer? Test-Time Scaling Improves Selective Question
Answering
Paper
• 2502.13962
• Published
• 28
Small Models Struggle to Learn from Strong Reasoners
Paper
• 2502.12143
• Published
• 39
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly
Possess Test-Time Scaling Capabilities?
Paper
• 2502.12215
• Published
• 16
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on
Continual Pre-Training
Paper
• 2502.11196
• Published
• 23
Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation
Paper
• 2502.08826
• Published
• 17
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published
• 20
Show Me the Work: Fact-Checkers' Requirements for Explainable Automated
Fact-Checking
Paper
• 2502.09083
• Published
• 4
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
• 2502.14499
• Published
• 194
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in
Agentic Tasks
Paper
• 2502.08235
• Published
• 59
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
• 2502.12084
• Published
• 33
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations
in Large Language Models
Paper
• 2502.14302
• Published
• 9
Can Large Language Models Detect Errors in Long Chain-of-Thought
Reasoning?
Paper
• 2502.19361
• Published
• 28
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
• 2502.18906
• Published
• 12
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM
Compression Preserve?
Paper
• 2502.17535
• Published
• 8
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
• 2502.17422
• Published
• 7
Chain of Draft: Thinking Faster by Writing Less
Paper
• 2502.18600
• Published
• 50
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper
• 2503.02268
• Published
• 11
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic
Templatisation and Orthographic Obfuscation
Paper
• 2503.02972
• Published
• 25
How to Steer LLM Latents for Hallucination Detection?
Paper
• 2503.01917
• Published
• 11
On the Acquisition of Shared Grammatical Representations in Bilingual
Language Models
Paper
• 2503.03962
• Published
• 4
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published
• 8
Can Large Vision Language Models Read Maps Like a Human?
Paper
• 2503.14607
• Published
• 10
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies
Ahead
Paper
• 2504.00294
• Published
• 10
What Makes a Good Natural Language Prompt?
Paper
• 2506.06950
• Published
• 11
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published
• 7
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive
Programming?
Paper
• 2506.11928
• Published
• 24
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes
Correct Reasoning in Base LLMs
Paper
• 2506.14245
• Published
• 45
Does Math Reasoning Improve General LLM Capabilities? Understanding
Transferability of LLM Reasoning
Paper
• 2507.00432
• Published
• 79
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI
Agents
Paper
• 2507.19478
• Published
• 32
Attention Basin: Why Contextual Position Matters in Large Language
Models
Paper
• 2508.05128
• Published
• 4
Don't Overthink It: A Survey of Efficient R1-style Large Reasoning
Models
Paper
• 2508.02120
• Published
• 20
Paper
• 2508.11737
• Published
• 112
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
• 2508.21767
• Published
• 12
What Characterizes Effective Reasoning? Revisiting Length, Review, and
Structure of CoT
Paper
• 2509.19284
• Published
• 23
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published
• 100
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Paper
• 2510.24563
• Published
• 23
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Paper
• 2602.16855
• Published
• 40
On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking
Paper
• 2602.16849
• Published
• 6