the core of this paper is about moving from “guessing code” to a structured agentic RL system that actually uses developer tools to self-correct
RFT (Rejection Fine-Tuning) and the warm-up
the authors found that starting RL from scratch is unstable because the model does not initially know how to use the agentic tools like the profiler or shell
the warm-up:
- first perform single-turn PPO on the base model to stabilize it
- collect successful trajectories from this model
RFT:
- take these successful trajectories, the ones that actually solved the task, and use them to fine-tune both the actor and the critic
- this seeds the model with the knowledge of what a successful tool-use session looks like before the large PPO training begins
performance and metrics
it tests against KernelBench which splits tasks into three levels of difficulty. They do not only measure whether the code runs. They measure whether it is faster than the industry standard
- faster rate vs. compile: this is the percentage of kernels that beat
torch.compile - level 1 and 2: 100% of their generated kernels beat the compiler
- level 3, the hardest: 92% of their kernels beat the compiler
- speed-up vs eager: achieving 12.0x speed-up, geometric mean, compared to standard pytorch eager mode on complex tasks
- on the toughest level-3 tasks beating claude opus 4.5 and Gemini 3 Pro by about 40% in optimization success

agentic workflow
instead of one-shot generation the model follows a structured workflow it is trained to:
- analyze input shapes and memory patterns
- profile code and find the real bottleneck, such as global memory versus shared memory
- verify result with automated tests so there are no silent numerical errors
- optimize iteratively until it reaches a performance target of at least 5% faster than the baseline
why it matters
most coding LLMs are stochastic parrots for syntax, this new approach is closer to a reasoning engine for hardware, it doesnt just know CUDA syntax it learns how to navigate GPU architecture by observing the results of its own experiments in a sandbox
resources
(END)