arxiv:2512.09196

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

Published on Dec 9

Authors:

Abstract

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with concise code, achieving expert-level performance still requires deep understanding of GPU architectures and low-level performance trade-offs. We present TritonForge, a profiling-guided framework for automated Triton kernel optimization. TritonForge integrates kernel analysis, runtime profiling, and iterative code transformation to streamline the optimization process. By incorporating feedback from profiling results, the system identifies performance bottlenecks, proposes targeted code modifications, and evaluates their impact automatically. Across diverse kernel types, TritonForge achieves up to 5x performance improvement over baseline implementations and on average 1.76x of the cases are successful, providing a foundation for future research in automated GPU performance optimization.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.09196 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.09196 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.09196 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.