Blaze

I recently started a new project called Blaze, a set of hand-written CUDA kernels targeting the NVIDIA B200 GPU (SM100 / Blackwell) for LLM inference. No frameworks, no wrappers — just CUDA C++ with inline PTX for Blackwell’s new tcgen05 instructions. The goal is end-to-end Llama-7B text generation using only custom kernels that exploit Blackwell’s new hardware primitives: tcgen05 tensor cores, Tensor Memory (TMEM), and the Tensor Memory Accelerator (TMA). This post covers Phase 0: hardware bringup on a B200 GPU. ...