CUDA Agent paper notes

the core of this paper is about moving from “guessing code” to a structured agentic RL system that actually uses developer tools to self-correct

RFT (Rejection Fine-Tuning) and the warm-up

the authors found that starting RL from scratch is unstable because the model does not initially know how to use the agentic tools like the profiler or shell

the warm-up:

first perform single-turn PPO on the base model to stabilize it
collect successful trajectories from this model

RFT:

take these successful trajectories, the ones that actually solved the task, and use them to fine-tune both the actor and the critic
this seeds the model with the knowledge of what a successful tool-use session looks like before the large PPO training begins

performance and metrics

it tests against KernelBench which splits tasks into three levels of difficulty. They do not only measure whether the code runs. They measure whether it is faster than the industry standard

faster rate vs. compile: this is the percentage of kernels that beat torch.compile
level 1 and 2: 100% of their generated kernels beat the compiler
level 3, the hardest: 92% of their kernels beat the compiler
speed-up vs eager: achieving 12.0x speed-up, geometric mean, compared to standard pytorch eager mode on complex tasks
on the toughest level-3 tasks beating claude opus 4.5 and Gemini 3 Pro by about 40% in optimization success

agentic workflow

instead of one-shot generation the model follows a structured workflow it is trained to:

analyze input shapes and memory patterns
profile code and find the real bottleneck, such as global memory versus shared memory
verify result with automated tests so there are no silent numerical errors
optimize iteratively until it reaches a performance target of at least 5% faster than the baseline

why it matters

most coding LLMs are stochastic parrots for syntax, this new approach is closer to a reasoning engine for hardware, it doesnt just know CUDA syntax it learns how to navigate GPU architecture by observing the results of its own experiments in a sandbox

resources

My bytedance highlighted paper

Share this article:

(END)

RFT (Rejection Fine-Tuning) and the warm-up

performance and metrics

agentic workflow

why it matters

resources

Join the discussion