Kernel Space

Surfacing a 60% performance bug in cuBLAS

Dmitry Trifonov — Thu, 09 Apr 2026 22:11:35 GMT

Every GPU programmer will tell you that you can’t beat cuBLAS at matrix multiplication. Matmul is the most popular operation by a large margin, and NVIDIA engineers have squeezed their GPUs dry. Of course, that doesn’t stop thousands of engineers, including myself, from playing this unfair sport.

I started this project as a learning exercise: write an FP32 SGEMM kernel for the RTX 5090 (Blackwell, sm_120) using the new TMA hardware introduced in Hopper and reach 80–90% of cuBLAS performance. That was the plan.

While benchmarking, the batched-mode numbers on the 5090 were 50–60% higher than cuBLAS at sizes from 1024 to 8192. That seemed suspiciously good for a learning exercise. So I profiled with ncu to see what was happening, and found that cuBLAS was dispatching a tiny simt_sgemm_128x32_8x5 kernel for the entire range of batched workloads, running at only 41% FMA pipe utilization (essentially using only 41% of available compute throughput).

I double-checked it on other GPUs and found out that the same libcublas.so binary uses a larger simt_sgemm_256x128_8x4 kernel at 73% FMA pipe utilization on the RTX PRO 6000, and an even better xmma_gemm family at 82% on the H200. RTX GPUs clearly receive less love from NVIDIA.

The reference kernel that I wrote using the new Blackwell feature — TMA (Tensor Memory Accelerator) — is still interesting. It achieves ~68% FMA pipe utilization with ~300 lines of generated C, whereas CUTLASS’s hand-tuned kernels require thousands of lines of templates to reach 73%. I will break down the TMA implementation to show how you can hit ~80–120% of cuBLAS performance with 10x less code than traditional templated approaches

Data and implementation are available in the DeploDock repository — GPU and LLM deployment, benchmarking, and optimization framework.

The Headlines

Single matmul on RTX 5090 — my kernel matches cuBLAS within 5 percentage points of FMA pipe utilization on every size (256 and 512 omitted; the per-call duration is too short and measurement variance is too high):

Batched matmul on RTX 5090–1.4–1.7× cuBLAS across the board:

The batched table is where it gets weird. Here’s the same cublasSgemmStridedBatched call on three different sm_90/sm_120 GPUs at batch=8, captured by ncu, showing the dispatched kernel and its FMA pipe utilization:

It is no surprise that cuBLAS schedules a different kernel for different matrix sizes. Kernels might perform differently on different matrix sizes, so cuBLAS tries to choose the best one. However, the behavior on different GPUs is quite different.

H200 (Hopper, sm_90) mixes the open-source CUTLASS template family at 1024–2048 with NVIDIA’s closed-source xmma_gemm family at 4096+. Within xmma_gemm it picks three different tile sizes (32×32 → 64×128 → 128×128) escalating with workload. Peak hits 82% FMA pipe utilization.
RTX PRO 6000 Blackwell Max-Q (sm_120) escalates within the CUTLASS family across three tile sizes (128×64 → 128×128 → 256×128), increasing FMA pipe utilization from 64% to 73%. Less sophisticated than H200, but still good. The one bug: at 256 / 512, it falls into a legacy magma_sgemmEx_kernel code path at 32% FMA pipe util. (MAGMA was NVIDIA’s pre-CUTLASS linear algebra library from the early 2010s, largely absorbed into cuBLAS — the fact that its kernels still appear in the dispatch path on a 2026 GPU is a window into how deep the legacy debt goes in NVIDIA’s stack.)
RTX 5090 (sm_120) picks the same simt_sgemm_128x32_8x5 kernel for every workload from 256×256 (≈microseconds per call) all the way to 8192×8192×8 batches (≈0.5 seconds per call). FMA pipe utilization stuck in the 33–42% band across the entire 32× range of linear dimensions.

This isn’t a wrong threshold somewhere in the dispatcher. There’s no escalation at any threshold. The escalation logic for the 5090 sm_120 batched FP32 path is missing entirely.

I haven’t tested kernels on other RTX GPUs like 5070 or 4090, but it is likely that the bug is present on all consumer GPUs.

What About cuBLASLt and Tensor Cores?

cuBLASLt is NVIDIA’s “lightweight” BLAS API that exposes more control than plain cuBLAS — you can query available algorithms, set workspace sizes, and force specific compute modes. A natural question: can cuBLASLt’s algorithm heuristics route around the 5090 batched-dispatch bug? And what about hybrid approaches using tensor cores with FP32 accumulators?

I tested all cuBLASLt compute modes at 4096×4096:

The FP32 path is locked to SIMT regardless of heuristic settings — cuBLASLt picks a simt_sgemm with a different tile (128×128 vs 256×128) but still cooperative cp.async loading, no TMA, and still no path to the dispatcher’s broken heuristic. The FAST_TF32/BF16 modes switch to tensor cores (tensorop_s1688gemm) and are 36–48% faster — but with reduced input precision. Note all three are cutlass_80-prefixed Ampere-era kernels.

For strict FP32 accuracy, tensor cores aren’t an option. TF32 truncates to a 10-bit mantissa; BF16 to a 7-bit mantissa. If your workload tolerates ~0.1% relative error, FAST_TF32 is the pragmatic choice. For exact FP32, the cuBLAS dispatcher bug applies, and there’s no public API to work around it without writing your own kernel.

Where My Kernel Fits In

My TMA template hits ~68% FMA pipe utilization on every Blackwell SKU we tested and ~71% on Hopper, which means it is about 10% behind cuBLAS, which chooses an efficient kernel, and 40–60% ahead of underperforming kernels on RTX 5090.

I will introduce my kernel now and the technologies used to make it work. To enjoy the following section of this article, it is good to familiarize yourself with general techniques for optimizing the Matmul kernel on the GPU. The How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance by Simon Boehm is a good starting point.

TMA vs LDGSTS: A Quick Primer

If you’re not deep in GPU programming, here’s the key distinction.

High-performance matmul kernels tile the computation: load a block of A and B from global memory (slow, ~1 TB/s) into shared memory (fast, ~20 TB/s), compute the partial products from shared memory, repeat. The bottleneck is the efficient transfer of data from global to shared memory. NVIDIA provides two hardware mechanisms for this — both asynchronous memory copies that bypass registers and transfer data directly.

LDGSTS (cp.async) is the traditional way to load data from global memory into shared memory. Every thread in the block participates: thread 0 loads element 0, thread 1 loads element 1, and so on. It’s cooperative — 256 threads each issue their own load instruction, generating 256 individual memory transactions. The hardware coalesces these into efficient cache-line transfers, but each thread still spends instruction slots on address computation, load issuance, and shared memory stores. CUTLASS and cuBLAS have used this approach since Ampere.

TMA (Tensor Memory Accelerator) is new hardware introduced in Hopper (sm_90) and refined in Blackwell (sm_120). Instead of 256 threads each loading one element, a single thread issues one cp.async.bulk.tensor.2d command that describes the entire 2D tile — “load a 32×224 block of floats starting at row R, column C.” The TMA hardware unit, separate from the CUDA cores, handles the entire transfer via DMA. The other 255 threads contribute zero instructions — they can compute while the load happens in the background. Blackwell’s TMA unit shares the same PTX interface as Hopper’s; the practical difference I observed is that Blackwell favors larger per-thread tiles (TM=28 is optimal on the 5090 vs TM=8 on the H200), suggesting the sm_120 TMA unit has lower per-issue overhead or better pipelining of concurrent descriptors.

In principle, TMA should be faster than LDGSTS because it removes per-thread loading overhead. In practice, on the workloads I measured, TMA and well-tuned LDGSTS land within 5% of each other at the FMA pipe utilization level. What TMA actually buys you is kernel implementation simplicity — you can write a fully-pipelined SGEMM kernel in ~300 lines of generated C, vs the thousands of lines of templated C++ that CUTLASS needs.

The TMA Double-Buffer Architecture

On Blackwell, the TMA hardware unit can load a 2D tile from global memory to shared memory with a single PTX instruction. One thread issues cp.async.bulk.tensor.2d, the hardware does the rest.

The kernel uses this in a double-buffer pipeline:

Tile 0: [TMA loads buf0] [wait] [compute buf0 + TMA loads buf1]
Tile 1:                         [wait buf1] [compute buf1 + TMA loads buf0]
Tile 2:                                     [wait buf0] [compute buf0 + TMA loads buf1]
...

While all 256 threads compute FMAs from the current buffer, thread 0 issues a TMA command to fill the other buffer. The TMA hardware runs on a separate path from the CUDA cores — true parallel execution.

Here’s the simplified kernel structure (TM=8 variant, 32 accumulators):

__global__ __launch_bounds__(256)
void fused_matmul(
    const __grid_constant__ CUtensorMap A_tma,
    const __grid_constant__ CUtensorMap B_tma,
    float* C)
{
    extern __shared__ __align__(128) char dsmem[];
    float* smem = (float*)dsmem;
    // Two mbarriers for double-buffer synchronization
    uint64_t* mbar = (uint64_t*)(dsmem + 2 * STAGE * 4);

    // Shared memory addresses for TMA targets
    const int as0 = __cvta_generic_to_shared(&smem[0]);
    const int bs0 = __cvta_generic_to_shared(&smem[A_SIZE]);
    const int as1 = __cvta_generic_to_shared(&smem[STAGE]);
    const int bs1 = __cvta_generic_to_shared(&smem[STAGE + A_SIZE]);

    // Thread identity
    int tid = threadIdx.y * 32 + threadIdx.x;
    int tr = threadIdx.y * TM, tc = threadIdx.x * 4;
    int bm = blockIdx.y * BM, bn = blockIdx.x * BN;

    // Initialize mbarriers (thread 0 only)
    if (tid == 0) {
        mbarrier_init(mbar[0]); mbarrier_init(mbar[1]);
    }
    __syncthreads();

    float c[TM][4] = {};  // Accumulators
    // Pre-load first tile
    if (tid == 0) {
        mbarrier_expect_tx(mbar[0], BYTES);
        tma_load_2d(as0, &A_tma, /*k=*/0, bm, mbar[0]);
        tma_load_2d(bs0, &B_tma, bn, /*k=*/0, mbar[0]);
    }

    for (int t = 0; t < K/BK; t++) {
        int s = t % 2;  // Current buffer

        // Wait for current tile’s TMA to complete
        mbarrier_wait(mbar[s], phase[s]);

        // Start loading NEXT tile (overlaps with compute)
        if (tid == 0 && t + 1 < nt) {
            tma_load_2d(next_buf_a, &A_tma, next_k, bm, next_mbar);
            tma_load_2d(next_buf_b, &B_tma, bn, next_k, next_mbar);
        }

        // Compute: all 256 threads do FMA from shared memory
        float* As = &smem[s * STAGE];
        float* Bs = &smem[s * STAGE + A_SIZE];
        #pragma unroll
        for (int kk = 0; kk < BK; kk++) {
            float b0 = Bs[kk*BN+tc], b1 = Bs[kk*BN+tc+1], ...;
            for (int i = 0; i < TM; i++) {
                float a = As[(tr+i)*BK+kk];
                c[i][0] += a * b0;
                c[i][1] += a * b1;
                // ... 4 FMAs per row
            }
        }
        __syncthreads();
    }

    // Write results to global memory
    for (int i = 0; i < TM; i++)
        store_row(C, bm+tr+i, bn+tc, c[i]);
}

The generated kernel is denser (with inline PTX for mbarrier and TMA operations), but this captures the architecture. You can find the kernel generator in deplodock/compiler/cuda/lower.py (see _lower_matmul_tma_db).

Compile-Time Specialization

A useful optimization is to make M, N, K #define constants, and not kernel parameters. The runner generates a fresh .cu file for each benchmark invocation with the actual dimensions baked in:

#define M 8192
#define N 8192
#define K 8192

__global__ void fused_matmul(...) {
    // nt = K/32 becomes nt = 256 - a literal constant
    // Bounds checks become dead code for aligned sizes
}

This lets nvcc optimize the tile loop bound, eliminate unreachable branches, and make better register allocation decisions. Moving M/N/K from runtime parameters to compile-time constants improved 1024 from 98% to 101% and 4096 from 100% to 101%.

Similarly, the write-back bounds checks (if (gr < M), if (gc < N)) are eliminated via #if when the matrix dimensions are multiples of the tile size:

#if (M % 224 == 0 && N % 128 == 0)
#define W(r, v0, v1, v2, v3) { /* no bounds checks */ }
#else
#define W(r, v0, v1, v2, v3) { /* with bounds checks */ }
#endif

In practice, this compile-time specialization makes sense for a set of common sizes (powers of 2, standard model dimensions) rather than generating a kernel for every possible M/N/K.

Size-Adaptive Tile Selection

Different matrix sizes have different optimal tile shapes. An adaptive strategy map selects the best configuration per size. These were found empirically by benchmarking various configurations on each GPU. For RTX 5090:

At TM=28, each thread computes 28×4 = 112 output elements, requiring 112 accumulator registers. The inner loop is fully unrolled (BK=32 iterations), and the compiler uses 241 registers total — close to the sm_120 limit of 255.

CTA Swizzle for L2 Cache Reuse

At 16384×16384, the working set is 3 GB — far beyond the 72 MB L2 cache. Without careful block scheduling, different blocks evict each other’s L2 lines. I linearize the grid and rasterize in groups of 8 row-tiles:

const int SWIZ = 8;
int pid = blockIdx.x;
int grp = pid / (gridDim.x * SWIZ);
int rem = pid % (gridDim.x * SWIZ);
int by = grp * SWIZ + rem % SWIZ;   // Row block
int bx = rem / SWIZ;                // Column block

This ensures 8 adjacent row-blocks run together, maximizing reuse of A-tiles in L2. ncu profiling confirmed this reduced DRAM throughput from 32% to 8.5% — matching cuBLAS’s 8.2%.

Batched mode

Batched mode is a one-line change: blockIdx.z selects the batch element, each batch gets its own TMA descriptor:

int batch = blockIdx.z;
const CUtensorMap& A_tma = A_tma_array[batch];
// ... same kernel, different data
float* C_batch = C + batch * M * N;

NCU Comparison with cuBLAS

All measurements: CUDA 13.2.51, cuBLAS 13.3.0, driver 595.58.03, captured by scripts/diagnostics/ncu_compare.sh.

A quick glossary for the metrics: IPC (instructions per cycle) measures how many instructions the SM issues per clock — higher is better, max ~4.0 on sm_120. FMA pipe is the percentage of cycles the fused multiply-add units are active — this is the actual compute throughput. Issue active is the percentage of cycles where at least one warp scheduler successfully issues an instruction — gaps here mean all warps are stalled.

For single matmul at 8192:

The performance gap maps directly to the FMA pipe gap. On the 5090 single mode, cuBLAS hits 72.9% FMA pipe utilization vs my 68.0% — a ~5% gap, which matches the ~5% efficiency gap in the headline tables (cycles_active: 34.8 M vs 37.6 M = 92.6% ratio). On the H200 single mode, cuBLAS hits 79.2% vs my 71.3% — a ~8% gap, matching the ~10% efficiency gap (cycles_active: 41.1 M vs 45.7 M = 89.9% ratio). It’s not bandwidth (DRAM throughput is 5–8%, nowhere near the limit). It’s not issue-active (both kernels are at 100%). It’s purely how many of the issued instructions actually land in the FMA pipe.

Cross-checking Against the Headline Efficiencies

The batched mode comparison is already covered in the headline section. The findings are the same: the performance gap maps directly to the FMA pipe gap. Putting the per-arch FMA pipe numbers next to the observed efficiencies:

The full ncu sweep with raw per-row data is committed in the recipe directory at recipes/sgemm_cublas_vs_tma/ncu/batched_dispatch_finding.md, reproducible with scripts/diagnostics/ncu_compare.sh.

We Need to Go Deeper: Beyond PTX

The article so far has the headline numbers and the dispatcher-bug evidence. Both are about which kernel runs and how the GPU dispatches it. Neither answers the more refined question: when cuBLAS does dispatch the right kernel — cutlass_80_simt_sgemm_256x128_8x4_nn_align1 on the 5090 single-mode path — why does it consistently hit ~73% FMA pipe utilization while my generated TMA template gets ~68%? This section is the SASS-level investigation that produced a measured answer, and walks through the reproducible scripts that surface it.

All numbers below are from running on an RTX 5090, CUDA 13.2.51, cuBLAS 13.3.0, single mode, 8192×8192. The full per-instruction ncu source view, the per-kernel stall histograms, and the inner-loop SASS excerpts are committed in recipes/sgemm_cublas_vs_tma/ncu/scheduling/ so you can re-derive them yourself.

Static Instruction Histograms

scripts/diagnostics/sass_analysis.py compiles a fresh tma_db bench binary, runs cuobjdump --dump-sass, and counts opcodes by family. For my fused_matmul (TM=28, BK=32) at 8192×8192:

3584 FFMA               — fused multiply-add (the actual compute)
  256 LDS.128           — 128-bit shared memory loads (float4)
  112 STG.E             — 32-bit predicated global stores (4 per row × 28 rows,
                          bounds-checked because 8192 % 224 ≠ 0; an aligned size
                          collapses to 28 STG.E.128 via nvcc auto-vectorization)
   48 CS2R / S2R        — clock + special-register reads
    4 UTMALDG.2D        — TMA load commands (the entire loading!)
  143 ISETP.*           — integer set-predicate (bounds + loop control)
   60 BAR/BSYNC/BSSY    — block barriers and reconvergence
   30 LDC.*             — constant loads (kernel params, TMA descriptors)
  169 MOV/IMAD/IADD/LEA — address arithmetic and reg copies

cuBLAS’s cutlass_80_simt_sgemm_256x128_8x4_nn_align1 ships as a PTX template inside libcublasLt.so and JIT-compiles at runtime. To see its actual SASS, capture the cubin via ncu while the kernel is running (ncu --set full --print-units base -o profile.ncu-rep ./cublas_probe, then ncu --import profile.ncu-rep --page source --print-source sass). The result has 1152 FFMAs and 256 LDS in the kernel body — a third of my FFMA count, because cuBLAS uses a smaller per-thread tile (more CTAs, fewer FMAs per thread). The notable structural fact: 0 shared-memory store instructions in mine, 256 in cuBLAS (cuBLAS pipelines through smem with cp.async + st.shared, my TMA hardware writes smem directly via DMA). That’s the one place TMA buys a real instruction-count saving.

LDS-to-consumer Scheduling

The first hypothesis I tested was the obvious one: maybe ptxas places LDS loads too close to their consumer FFMAs, and the warp scheduler stalls waiting for the load to complete. To check, I extended the diagnostic with scripts/diagnostics/scheduling_analysis.py — it parses the disassembly, walks each LDS.* forward through the instruction stream, and finds the first downstream FFMA that uses any of the loaded registers. The distance between the load and its consumer is your latency-hiding budget.

For my fused_matmul at 8192:

FFMAs between LDS and first consumerCount[0, 5)3[5, 10)1[10, 20)13[20, 40)110[40, 80)117[80, 160)11[160, ∞)1

Median: 40 FFMAs; mean: 44.6. Only 4 of 256 LDS have a consumer within 10 FFMAs. Blackwell LDS latency is on the order of 30 cycles, and each FFMA is one cycle on the FMA pipe, so ptxas essentially hides LDS latency perfectly. The “ptxas places LDS too close to consumers” hypothesis was wrong. That’s not where the gap is.

Running the same analysis on the cuBLAS kernel (extracted from the ncu source view) gives a completely different shape:

minecuBLASFFMAs in kernel body35841152LDS instructions256256LDS / FFMA ratio1 per 141 per 4.5Median LDS → first consumer40 FFMAs158 FFMAsMedian LDS → next LDS spacing5 FFMAs0 FFMAs

cuBLAS’s median LDS-to-next-LDS spacing is zero — its LDS instructions are clustered into back-to-back groups. My kernel evenly distributes LDS across the FFMA cluster, with a median of 5 FFMAs between consecutive loads. Both schedules hide LDS latency well (40 vs. 158 FFMAs), but they produce fundamentally different warp behaviors at runtime.

The difference matters because of how it affects warp staggering across the SM’s warp schedulers:

cuBLAS schedule (clustered LDS):
  warp 0: [LDS LDS LDS LDS LDS LDS] [FFMA FFMA FFMA FFMA FFMA FFMA FFMA ...]
  warp 1:        [LDS LDS LDS LDS LDS LDS] [FFMA FFMA FFMA FFMA FFMA ...]
  warp 2:               [LDS LDS LDS LDS LDS LDS] [FFMA FFMA FFMA FFMA ...]
           ← warps naturally stagger: while warp 0 does FFMAs, warp 1 is in
             its LDS cluster, so the FMA pipe sees steady demand from ONE warp
             at a time → low dispatch_stall

My schedule (interleaved LDS):
  warp 0: [FFMA FFMA FFMA FFMA LDS FFMA FFMA FFMA FFMA LDS FFMA FFMA ...]
  warp 1: [FFMA FFMA FFMA FFMA LDS FFMA FFMA FFMA FFMA LDS FFMA FFMA ...]
  warp 2: [FFMA FFMA FFMA FFMA LDS FFMA FFMA FFMA FFMA LDS FFMA FFMA ...]
           ← all warps are in the SAME phase: they all want the FMA pipe on
             the same cycles → high dispatch_stall (44% vs 22%)

You can see the difference in the actual inner-loop SASS excerpts. Here’s a 30-line slice from cuBLAS’s cutlass_80_simt_sgemm_256x128 (recipes/sgemm_cublas_vs_tma/ncu/scheduling/cublas_inner_loop_excerpt.txt):

**LDS.128 R132, [R130]**            ← 6 LDS in a row (clustered)
  LDCU.64 UR16, c[0x0][0x3c0]
  SHF.R.U32.HI R30, RZ, 0x1, R131
  IADD.64 R200, R200, UR4
**LDS.128 R140, [R130+0x40]**
  LDCU.64 UR14, c[0x0][0x3e0]
  LOP3.LUT R185, R185, 0xffc, R30, 0xc8, !PT
  IADD.64 R204, R204, UR10
**LDS.128 R144, [R130+0x80]**
  ISETP.NE.AND P0, PT, R189, RZ, PT
  MOV R186, RZ
**LDS.128 R148, [R130+0xc0]**
  ...                              ← then the FFMA burst
**LDS.128 R156, [R130+0x200]**
  FFMA R127, R132, R136, R127      ← consumer arrives 158 FFMAs later in steady state
  FFMA R128, R133, R136, R128
  FFMA R126, R133, R137, R126
  FFMA R125, R132, R137, R125
  FFMA R123, R134, R136, R123
  ... (long FFMA run with occasional single LDS interleaved)

And here’s mine (recipes/sgemm_cublas_vs_tma/ncu/scheduling/fused_matmul_inner_loop_excerpt.txt):

FFMA R140, R36, R144, R159       ← FFMAs running
  FFMA R158, R37, R144, R158
  FFMA R161, R38, R144, R161
  FFMA R160, R39, R144, R160
  FFMA R144, R36, R148, R163
  FFMA R162, R37, R148, R162
  FFMA R165, R38, R148, R165
  FFMA R164, R39, R148, R164
  FFMA R3,   R41, R152, R166
**LDS.128 R36, [R15+0x8400]**    ← single LDS in the middle of the cluster
  FFMA R168, R41, R153, R168
  FFMA R167, R41, R154, R167
  FFMA R40,  R41, R155, R40
  FFMA R5,   R45, R152, R172
  ... (continues with one LDS every ~5 FFMAs)

These are two valid SGEMM schedules. Both feed the FMA pipe. They differ in how the warps stagger.

Per-warp stall reasons

I captured the per-warp stall reasons from ncu. The script is scripts/diagnostics/ncu_stall_compare.sh; it builds a small probe binary for both kernels at the same shape and extracts the smsp__average_warps_issue_stalled_*_per_issue_active metrics. Each value is “warps stalled on this reason per issue-active cycle” — sums can exceed 100% when multiple warps stall in parallel.

For 5090 single-mode 8192:

There are two large deltas:

dispatch_stall = 44 % vs 22 %. Dispatch stall happens when the warp scheduler has picked a ready warp, but the dispatch unit can’t accept another instruction this cycle — typically because some other warp’s in-flight FFMA has the FMA pipe back-pressured. My kernel has twice as many dispatch stalls as cuBLAS does, and that’s the dominant cause of the FMA pipe utilization gap.

short_scoreboard = 20 % vs 12 %. Short scoreboard stalls depend on short-latency operations (LDS reads), during which the scheduler waits for the scoreboard bit to clear. Even though my static LDS-to-consumer distance is 40 FFMAs (more than enough to hide the latency in isolation), the consumers are tightly interleaved into a long FFMA run, so the scoreboard’s temporal hiding is shorter than the static count suggests.

Both deltas point to the same root cause: warp-phase alignment. With my spread-LDS pattern, all warps are in roughly the same execution phase at the same time — they all want to execute FFMA instructions on the same cycles. With cuBLAS’s clustered-LDS pattern, warps stagger naturally: while warp A is draining a long FFMA run, warp B is in its LDS cluster, warp C is finishing a previous FFMA cluster. The warp scheduler always has a different warp to switch to, rather than contending for the FMA pipe.

The performance gap between my kernel and cublas is caused by the temporal distribution of LDS instructions across warps, which determines whether warps stagger or align and, in turn, how much dispatch-stall pressure piles up on the FMA pipe.

A note on the `mbarrier.try_wait` spin loop

A common concern with TMA double-buffer schemes: don’t threads waste cycles spinning in mbarrier.try_wait? Empirically, no. The TMA transfer for the 45 KB double-buffer slot (bytes=45056 in the kernel source) completes well within the FFMA compute phase, so the try-wait spin loop almost always exits on its first read. The dispatch_stall and short_scoreboard numbers above don’t include any meaningful contribution from try-wait spinning — both wait = 3.9% and barrier = 7.3% are small compared to the dispatch-side gap.

Where is the Limit?

The constant 5–11% gap below cuBLAS’s best-available kernel (when the dispatcher does its job) shows up on every architecture I tested. I systematically tried to close it. None of these worked:

The irreducible gap lies in ptxas’s instruction-scheduling heuristics — specifically, in how it distributes LDS instructions across the FFMA cluster. As measured in the SASS Deep Dive section above, my generated kernel ends up with dispatch_stall = 44% versus cuBLAS’s 22%, because my LDS pattern is spread evenly through the FFMA cluster (median spacing 5 FFMAs) while cuBLAS’s CUTLASS template clusters them into back-to-back groups (median spacing 0).

A Note on FP16 / BF16 — the Mainstream Path

The mainstream compute path on modern GPUs is FP16/BF16, possibly with FP32 accumulators for training. That’s where NVIDIA puts the optimization effort. Pure FP32 SGEMM is no longer the priority — even though it remains important for scientific computing, numerical simulation, and other use cases that cannot tolerate reduced precision.

I confirmed that the FP16 path on the 5090 is tensor-core accelerated: an ncu profile of cublasHgemm and cublasGemmEx at 4096×4096 dispatches cutlass_80_tensorop_h16816gemm_... and cutlass_80_tensorop_f16_s16816gemm_... respectively — both tensorop (HMMA 16×8×16), with the SIMT FFMA pipe sitting at <0.2% utilization. So the FP16/BF16 effort is real and visible. The catch: those kernels are also cutlass_80_*-prefixed Ampere forward-ports — Blackwell’s incremental tensor-core tuning effort goes into the new low-precision formats (FP8, MXFP4) used by frontier-model training, not into the basic FP16 path. The Ampere kernel reuse on sm_120 isn’t unique to the unloved FP32 SIMT path; it’s the dominant pattern across most of cuBLAS’s compute paths on Blackwell.

Benchmark Methodology and Other Results

All measurements: 30 iterations (single) or 20 iterations (batched), interleaved with cuBLAS for thermal fairness; the first iteration of every loop is discarded as warmup; median reported. Compiled with nvcc -O3 --fmad=true (no --use_fast_math — FFMA fusion is preserved, but FTZ/relaxed-div are off, so the comparison is IEEE-clean FP32. RTX 5090 (170 SMs, 32 GB GDDR7), driver 595.58.03, CUDA 13.2.51, cuBLAS 13.3.0. The reproducible recipe is in recipes/sgemm_cublas_vs_tma/ — deplodock bench recipes/sgemm_cublas_vs_tma --local.

The 256/512 sizes are removed from the single-matmul table. At sub-millisecond per-call durations, the GPU’s boost clock never engages, and the SM clock bounces around for the duration of the run. The bench runner samples nvidia-smi --query-gpu=clocks.sm around each measurement; one full single-batch sweep looks like:

The clock is not locked to emulate real-world performance. Since my kernel and cuBLAS are interleaved iteration-by-iteration, the ratio stays meaningful at whatever clock the governor picks at that instant.

The headline tables at the top of this article show the full RTX 5090 sweep; the rest of this section covers other hardware.

RTX PRO 6000 Blackwell (Max-Q)

Same architecture as the 5090 (sm_120), but 188 SMs vs 170, and a lower power budget. Provisioned on CloudRift with the same toolchain (driver 595.58.03 / CUDA 13.2.51), so the comparison is apples-to-apples.

Pro 6000 single matmul:

Pro 6000 batched matmul — note that B=4/8/16 land at ~93–95% (not 150–170% like the 5090) because the Pro 6000 dispatcher actually escalates correctly:

The 512/1024/2048 cells, where I still beat cuBLAS, are the Pro 6000’s small-size MAGMA fallback bug from the dispatcher table. The 4K+ cells are the constant 5–7% generator gap.

H200

TM=8 is optimal at every size on Hopper — larger thread tiles regress, likely because the first-gen Hopper TMA has more issue pressure than Blackwell’s refined unit. The H200 is the cleanest control case: when cuBLAS dispatches the right kernel (and on Hopper it always does — see the dispatcher table), my generator loses by the same constant 8–11% as everywhere else.

H200 single matmul:

H200 batched matmul:

cuBLAS on Hopper hits ~50 TFLOPS across the size range (the H200’s HBM3e bandwidth easily feeds the SIMT cores), with FMA pipe utilization climbing from 69% at 512 to 82% at 4K+. My TMA template hits ~47 TFLOPS at 71% FMA pipe util across the range — uniform 8–11% gap to the well-dispatched cuBLAS baseline.

Related Bug Reports

After finding all of this, I went looking for prior reports of cuBLAS picking suboptimal kernels for SGEMM on consumer NVIDIA GPUs. It turns out this isn’t a new bug type. Substantially similar reports have surfaced at least twice before, both times reported on NVIDIA’s developer forums and acknowledged by NVIDIA engineers.

Pascal card is calling Maxwell kernels through cublas. It is unusably slow. (NVIDIA Developer Forums, 2018) — closest match. A user with a GTX 1080 Ti (Pascal sm_61) reported cublasSgemmStridedBatched running maxwell_sgemm_128_64_nn kernels, about 2× slower than a naïve hand-rolled kernel for batched workloads.

cuBLAS sgemm is slow (NVIDIA Developer Forums, 2017). Different shape — extreme aspect ratio, 2×23880 × 23880×32 — but the same root cause: cuBLAS’s dispatcher picked a tiny grid [1,1,1] × block [8,8,1], leaving the GPU idle.

vLLM #35467 (2025). vLLM developers report that on certain matmul shapes, “cuBLAS auto-selects the 7th-best tile (128x136) instead of optimal options, with the heuristic leaving 16% performance on the table.”

Simon Boehm’s CUDA matmul worklog (cited indirectly in many places). Notes the structural fact that “cuBLAS contains not one single implementation of SGEMM, but hundreds of them, and at runtime, based on the dimensions, cuBLAS will pick which kernel to run” — and “cuBLAS may set a too small grid size, which can be identified through profiling tools.” Published acknowledgment that the heuristic has known holes.

Conclusion

The headline TMA win on the 5090 turned out to be a cuBLAS dispatcher bug, not a hardware advantage — and, as the Related bug reports section above shows, it’s the latest instance of a recurring pattern that NVIDIA has acknowledged on its own developer forums since at least 2018. NVIDIA ships a release version of cuBLAS that, on the most popular consumer Blackwell SKU, selects an FP32 SGEMM kernel that runs at ~40% of peak FMA pipe utilization across the entire range of batched workloads. The exact same library binary escalates correctly to a 73% kernel on the RTX PRO 6000 and to an 82% kernel on the H200. It’s not subtle: the 5090 path picks the wrong kernel 100% of the time across the entire 32× workload range I measured, and the same library has better kernels sitting right there.

I only verified this on the RTX 5090. The dispatch logic is clearly arch-specific, so it would not surprise me if other consumer RTX cards (5070, 5080, 4090, …) hit similar bugs in their respective dispatch paths. If you have one of those cards, the diagnostic script that surfaced lives in the repo at scripts/diagnostics/ncu_compare.sh. Three minutes of ncu will tell you whether your batched FP32 workloads are leaving 60% on the floor.

Don’t blindly trust cuBLAS on new architectures and/or RTX cards. Check the kernel name in ncu. If you see something like cutlass_80_simt_sgemm_128x32_8x5 running for a workload that should clearly be on a 256×128 kernel, you’re hitting the bug.

Separately, the TMA + compile-time specialization technique is worth knowing for its own sake. It produces a fully pipelined SGEMM kernel template in ~300 lines of generated C that hits ~93% of CUTLASS’s hand-tuned peak FMA pipe utilization on every Blackwell SKU I tested. TMA might be useful in many other workloads that leverage conventional CUDA cores.

References

NVIDIA CUDA Programming Guide — Tensor Memory Access (TMA)
NVIDIA CUDA Programming Guide — Asynchronous Data Copies
NVIDIA CUTLASS — SIMT SGEMM reference
Simon Boehm — How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
Modular — Matrix Multiplication on NVIDIA’s Blackwell
Lei Mao — CUDA Shared Memory Swizzling
CuAsmRL — SASS Optimization via Reinforcement Learning
Colfax Research — Efficient GEMM with Pipelining

GPU Virtualization with VFIO, NVAI Enterprise, and AMD SR-IOV

Dmitry Trifonov — Thu, 09 Apr 2026 15:05:13 GMT

There is no single GPU virtualization stack that works across all GPUs. Datacenter and consumer GPUs require different approaches. There are differences between the virtualization of NVIDIA and AMD cards. Even within NVIDIA’s own enterprise AI ecosystem, there are multiple virtualization paths. If you’re building a cloud GPU platform that supports hardware from multiple vendors, you’re going to end up implementing and maintaining several distinct virtualization strategies.

At CloudRift, we support all modern virtualization paths: VFIO passthrough for whole-GPU allocation, NVIDIA MIG with AI Enterprise vGPU for fractional NVIDIA GPUs, and AMD SR-IOV for AMD Instinct cards. This article explains the mechanics behind each one — the host-side driver lifecycle, the domain XML configuration, and the trade-offs.

I’ll assume you’re familiar with basic Linux virtualization concepts (KVM, QEMU, libvirt). If you need a refresher on host-side IOMMU and VFIO setup, see our earlier guide: Host Setup for QEMU/KVM GPU Passthrough with VFIO on Linux.

The foundation: QEMU/KVM with libvirt

Regardless of GPU vendor, all our VMs run on the same hypervisor stack: QEMU/KVM managed through libvirt. A few configuration choices matter a lot when GPUs are involved.

Machine type and PCI topology

We use the q35 chipset with the pc-q35 machine type. Unlike the older i440fxq35 provides native PCIe support, which is essential for GPU passthrough — GPUs are PCIe devices that expect a PCIe bus topology.

hvm

Everything described here works with any Linux distribution, but most of our host providers run Ubuntu, and that’s what we test against. The examples below assume Ubuntu 22.04 or 24.04.

Use a modern QEMU and OVMF. The versions shipped with Ubuntu 22.04 LTS (QEMU 6.2, OVMF 2022.02) are too old for reliable GPU passthrough with current hardware. We target QEMU 9.0+, OVMF 2024.02, and libvirt 10.6+, installed from the Canonical Server Backports PPA and Ubuntu Noble packages, respectively. The specific bugs that forced the upgrades: OVMF 2022.02 hangs during boot when initializing RTX GPUs on some platforms, and QEMU 6.2 hits a pci_irq_handler assertion failure during GPU passthrough with AMD MI350X.

One critical setting is the PCI hole size. Modern data center GPUs (H100, B200) have BARs that can exceed 128 GB. QEMU has internal logic to estimate the PCIe hole size, but we found that it doesn’t always allocate enough space. The root cause is that PCI BAR allocation involves three parties — the host kernel (which assigns physical BAR addresses), the guest UEFI firmware (OVMF, which allocates the guest-side PCI address space), and QEMU (which maps between them). Because QEMU doesn’t have full visibility into how OVMF will lay out the address space, its built-in heuristic can underestimate the required hole, especially with multiple large GPUs.

We work around this by computing the hole size ourselves based on actual BAR sizes reported by the hardware. In the domain XML, this appears as a element on the PCIe root controller:

Getting this wrong results in the guest failing to map GPU BARs — you’ll see “BAR X: can’t assign mem” errors in dmesg inside the VM.

CPU mode

We run all VMs in host-passthrough mode with migratable=on:

This exposes the host CPU’s full instruction set to the guest, which is important for GPU workloads that rely on specific CPU features (such as AVX-512 for preprocessing). The migratable=on flag filters out features that would break live migration, though in practice we rarely migrate GPU VMs since the GPU itself isn’t migratable.

PCIe root ports

Each GPU (or GPU group, in the case of multi-function devices with audio) gets its own PCIe root port controller. This keeps GPU IOMMU groups clean inside the guest and avoids conflicts:

On consumer GPU rigs (RTX 4090, 5090), we use a flat topology: all GPUs share the same bus, and each GPU gets a different slot. On multi-GPU servers (8×H100, 8×B200), we use a deep topology — each GPU sits behind its own PCIe root port on a separate bus.

Flat topology (consumer GPUs — faster to allocate, smaller PCIe hole):

Deep topology (data center GPUs — one root port per GPU):

The flat layout is preferable when possible — it requires fewer PCIe root port controllers, reduces the PCIe hole requirement, and is faster for QEMU to set up. But data center GPUs work more reliably behind dedicated root ports.

ROM BAR

We always disable ROM BAR for passthrough GPUs:

Without this, OVMF (the UEFI firmware) tries to load the GPU’s option ROM during boot. With newer GPU firmware, this can cause hangs or extremely slow boot times. Since we’re not using the GPU for console output (cloud VMs use serial/VNC), there’s no reason to load the ROM.

NVIDIA: full GPU passthrough via VFIO

This is the straightforward mode. One physical GPU goes to one VM. It’s what we use for consumer GPUs (RTX 4090, RTX 5090, RTX PRO 6000) and for data center GPUs when the tenant needs the whole card.

Fair warning: NVIDIA’s documentation around GPU virtualization is a maze. There’s VFIO passthrough, mdev (mediated devices), vGPU, SR-IOV, MIG — and the terminology overlaps in confusing ways across different driver generations and product lines. If you’re making changes to your virtualization stack, expect to speak with NVIDIA’s support team. The docs alone won’t get you there.

Using host and guest GPUs simultaneously

One scenario worth mentioning: what if you want to use some GPUs on the host (e.g., for LLM inference) while simultaneously passing others through to VMs? This is possible, but only with the open-source NVIDIA kernel driver stack. You can leave the host GPUs bound to the open nvidia driver and bind the passthrough GPUs vfio-pci independently. NVIDIA AI Enterprise doesn’t support this mixed mode, at least according to the response to our support inquiry in February 2026.

The tricky part is the driver lifecycle on the host. NVIDIA’s kernel driver is notoriously sticky, and you have to fully release the GPU before VFIO can claim it. Here’s the sequence:

Preparation (host → VFIO):

Load the vfio-pci kernel module
Stop the processes holding the GPU, like the DCGM exporter (if running — it holds a handle on the GPU)
Disable persistence mode and stop nvidia-persistenced
Unload NVIDIA kernel modules (nvidia-uvm, nvidia-drm, nvidia-modeset, nvidia)
Unbind each GPU (and its audio function) from the nvidia driver
Bind each device to vfio-pci
Wait for VFIO initialization (the device files under /dev/vfio/ need a moment to appear)

If any step fails — especially the module unload — something is still holding a reference to the GPU. Common culprits: an orphaned nvidia-smi process, a monitoring daemon, or a zombie compute process.

Return (VFIO → host):

When the VM shuts down, we reverse the process:

Rebind the GPU to the nvidia driver
Rebind the audio function to snd_hda_intel
Start nvidia-persistenced
Re-enable persistence mode
Verify CUDA readiness with a quick device query. After rebinding, the GPU may appear in nvidia-smi but still fail when accessed from a Docker container (you will get “no CUDA-capable device”). Running a short CUDA initialization on the host “warms up” the device and ensures the driver state is fully consistent.

Domain XML for VFIO passthrough

The GPU appears as a element with managed="no" — We handle driver binding ourselves rather than letting libvirt do it, because the multi-step NVIDIA teardown requires more control than libvirt’s managed mode provides:

The multifunction="on" The attribute is relevant for consumer GPUs (RTX series), which have a companion HDA audio device at function 0x1. Both functions need to be in the same multifunction group for the guest to enumerate them correctly. Data center GPUs (H100, B200) don’t have an audio function, so this attribute isn’t needed for them.

NVIDIA: fractional GPUs with MIG and vGPU

NVIDIA’s Multi-Instance GPU (MIG) technology lets you partition a single GPU into isolated instances, each with its own compute units, memory, and memory bandwidth.

How MIG works

MIG is available on data center GPUs (A100, H100, H200, B200) and creates hardware-level partitions. Unlike time-sharing (MPS), MIG instances receive dedicated compute slices.

Each GPU supports a set of profiles. For example, an H100 with 80 GB HBM3 memory can be split into:

The 7g.80gb profile gives you the entire GPU but through the MIG/vGPU pathway instead of VFIO passthrough. We use this when the node is configured for NVIDIA AI Enterprise licensing and all VMs need to go through the vGPU stack, even single-tenant ones.

An important constraint: you can’t arbitrarily slice GPUs. Only specific profile combinations are allowed per GPU. The H100 has 7 GPC (Graphics Processing Cluster) slices, and profiles must tile across them without overlap. Here are the valid placement configurations for the H100 80GB:

Configs 1–11 use 1g.10gb profiles, while configs 12-19 mix in 1g.20gb — double the memory, same compute, but consuming memory from 2 GPC slices, so fewer fit. There’s also a 1g.10gb+me variant that includes media engines (NVDEC/NVJPG).

Profile names and memory sizes vary by GPU model — H200 uses 1g.18gb, 2g.35gb, 3g.71gb, 7g.141gb, etc., but the slice tiling rules are the same.

MIG device creation

Creating a MIG instance requires several steps:

Ensure the NVIDIA driver is loaded on the host (MIG is managed by the host driver, not VFIO)
Enable MIG mode on the target GPU via nvidia-smi
Query available profiles to get the GPU Instance Profile ID
Create a GPU Instance (GI) using the profile ID nvidia-smi mig -i0 -cgi 14
Create a Compute Instance (CI) within the GI (or use -C with -cgi to do both in one step): nvidia-smi mig -i0 -cgi 14 -C
Retrieve the vGPU BDF — each MIG instance has an associated SR-IOV Virtual Function. You can find the VF’s BDF by reading the symlinks under /sys/bus/pci/devices//virtfnN, then writing the appropriate vGPU type to the VF’s nvidia/current_vgpu_type sysfs file. The resulting VF BDF is what gets passed to the VM via libvirt.

The key difference from full passthrough is that the host NVIDIA driver stays loaded and manages the GPU. The MIG instance appears as a virtual device that gets passed to the guest via libvirt’s mechanism. This, in turn, allows using the host to access GPUs for specific purposes, for example, tracking utilization using NVIDIA DCGM.

Domain XML for MIG vGPU passthrough

The MIG vGPU device is attached to the VM the same way as a full GPU — via a element. The difference is the source address: instead of the physical GPU’s BDF, you use the VF BDF that was assigned to the MIG instance:

Note there’s no multifunction="on" here — MIG vGPU devices don’t have a companion audio function. Otherwise, the XML is identical to VFIO passthrough: managed="no", explicit , and ROM BAR disabled.

NVIDIA AI Enterprise guest stack

Inside the VM, we install the NVIDIA AI Enterprise (NVAI) driver package rather than the standard driver. It can be downloaded from the NVIDIA licensing portal after purchasing a license. The NVAI guest driver is specifically designed for vGPU environments:

Proprietary DKMS modules — the open-source NVIDIA driver doesn’t support vGPU. We explicitly remove any nvidia/x.y.z-open DKMS modules and install the proprietary nvidia/x.y.z modules
nvidia-gridd — the grid daemon that handles vGPU license checkout from a license server
nvidia-persistenced — keeps the driver loaded even when no GPU applications are running
CUDA toolkit — installed in the guest for compute workloads

The guest image is built with these components baked in, so the VM boots with full GPU support ready. No driver installation on first boot.

MIG cleanup

When a VM is destroyed, we track its MIG instance IDs (stored in the domain’s libvirt metadata) and destroy the corresponding GPU and Compute instances:

nvidia-smi mig -i 0 -dgi -gi 5
# Successfully destroyed GPU instance ID 5 from GPU 0

This releases the GPU slices back to the pool for the next tenant.

Licensing costs

NVIDIA AI Enterprise licensing is not cheap. At the time of writing, it costs roughly $4,500 per GPU per year (source). A whopping $36,000/year for the 8xGPU server. Over the H100 server’s 4-year lifetime, it will increase your cost to own by 50% (you can save some by using a 3-year or lifetime subscription, at the cost of a higher upfront). This is one reason we primarily use VFIO passthrough for GPUs and only route through the vGPU stack when fractional allocation is needed. If a tenant rents a whole GPU, VFIO passthrough avoids the licensing overhead entirely.

AMD: SR-IOV with GIM and ROCm

AMD takes a slightly different approach to GPU virtualization. Interestingly, NVIDIA’s vGPU stack also uses SR-IOV under the hood on supported data center GPUs — the host driver calls /usr/lib/nvidia/sriov-manage to enable SR-IOV Virtual Functions, and MIG instances are mapped onto those VFs. But NVIDIA layers its own abstractions on top: you create MIG instances first, then associate them with VFs and set a vGPU type via sysfs. AMD’s approach is more direct — PCIe SR-IOV (Single Root I/O Virtualization) is the primary interface, the same technology network cards have used for years. The GIM driver creates VFs, and you pass them through with managed="yes". No intermediate abstraction.

Of the three virtualization modes we support, AMD SR-IOV was by far the easiest to implement. Standard PCIe mechanism, managed passthrough, no nvidia-smi invocations, no licensing hurdles, all images are easy to download and install, and it supports fractional GPU allocation out of the box. Good job, AMD.

How AMD SR-IOV differs

With SR-IOV, the GPU’s Physical Function (PF) stays bound to the amdgpu driver on the host at all times. The GIM (GPU Instance Manager) driver creates Virtual Functions (VFs), each representing a slice of the GPU. These VFs are what get passed to VMs.

This means:

No runtime driver switching — the amdgpu kernel module must be loaded when the host boots and stays loaded.
The host driver manages partitioning — similar in spirit to MIG, but implemented at the PCIe level rather than via a vendor-specific API.
VFs are standard PCI devices — they show up in lspci, have their own BDF addresses, and can be managed with standard PCI tooling.

In SPX (Single-GPU Instance) mode, each Physical Function exposes exactly one Virtual Function. This is equivalent to passing the entire GPU to a single VM, but via the SR-IOV path rather than full VFIO passthrough.

Managed passthrough

Because AMD’s SR-IOV VFs are standard PCI virtual functions, we can let libvirt handle the VFIO binding automatically (managed="yes")

Notice two differences from the NVIDIA XML:

managed="yes" — libvirt handles binding the VF to vfio-pci and unbinding it when the VM stops
No element — not needed when managed mode is active

ROCm guest stack

The guest-side setup for AMD involves the ROCm (Radeon Open Compute) software stack:

HWE kernel — the default Ubuntu 24.04 kernel (6.8) is too old for ROCm’s amdgpu DKMS module. We install the Hardware Enablement kernel (6.11+) so DKMS can build against a compatible kernel.
amdgpu-install — AMD’s bootstrapper that sets up signed APT repositories. All packages are verified via AMD’s GPG key.
ROCm toolkit with DKMS — the compute stack and kernel module for GPU access inside the guest.
AMD SMI — AMD’s equivalent of nvidia-smi for GPU monitoring and management.
Group configuration — we ensure users are automatically added to the render and video groups so they can access the GPU device files without root.

Like our NVIDIA images, the ROCm stack is pre-installed in the guest image.

Fractional GPU allocation

AMD’s SR-IOV implementation supports fractional GPU allocation natively — unlike NVIDIA, which requires a separate MIG mechanism. The GIM driver on the host can create multiple Virtual Functions per Physical Function, each representing a partition of the GPU’s compute and memory resources.

The number and size of VFs are configured at the driver level through partition modes. AMD Instinct GPUs (MI300X, MI350X) support several:

SPX (Single Partition eXtension) — 1 VF per PF. Whole GPU to one VM. This is what we currently use.
DPX (Dual Partition) — 2 VFs per PF. Each VF gets half the GPU’s compute and memory.
QPX (Quad Partition) — 4 VFs per PF.
CPX (Core Partition) — up to 8 VFs per PF on MI300X (one per XCD die).

Compared to NVIDIA MIG, AMD’s partitioning is simpler to reason about: the mode is set at the driver/firmware level, the VFs appear as standard PCIe devices, and there are no profile compatibility constraints to worry about. The trade-off is less granularity — you can’t mix partition sizes on the same GPU the way MIG allows (e.g., one 3/7 instance and one 4/7 instance).

From the libvirt perspective, there’s no difference between a whole-GPU VF and a fractional VF. Both are standard PCIe virtual functions, and both use managed="yes"and both use the same XML structure.

Comparison

Here’s how the three approaches stack up:

Closing thoughts

GPU virtualization is harder than CPU or network virtualization because GPUs were designed for bare-metal performance, not sharing. Both NVIDIA and AMD have made significant progress, but have taken slightly different approaches: NVIDIA offers more flexibility (flexible MIG profile combinations), while AMD offers greater simplicity (standard SR-IOV; virtual functions are managed by the driver).

If you want to try it out, you can rent GPU VMs on CloudRift — we have NVIDIA RTX, H100, H200, and B200, as well as AMD Instinct machines available.

Host Setup for QEMU KVM GPU Passthrough with VFIO on Linux

Dmitry Trifonov — Thu, 09 Apr 2026 04:51:40 GMT

GPU passthrough shouldn’t feel like sorcery. If you’ve ever lost a weekend to half-working configs, random resets, or a guest that only boots when the moon is right, this guide is for you. I have pulled lots of hair while hardening the CloudRift VM service for a variety of consumer (RTX 4090, 5090, PRO 6000) and data center (H100, B200) GPUs, so writing this guide
to help you avoid common pitfalls.

I’ll focus specifically on the host node configuration for GPU passthrough. Thus, this guide is relevant regardless of whether you’re using Proxmox or plain libvirt/QEMU. The provided instructions have been tested on Ubuntu 22.04 and 24.04 with various NVIDIA GPUs.

To keep this guide manageable, I won’t delve into lower-level details, such as specific domain XML tricks, Linux kernel builds, or GPU firmware flashing. In most cases, you don’t need to fiddle with those.

An 8-GPU rig assembled by NeuralRack — a typical VM GPU-rental rig.

1. Remove NVIDIA drivers

The first step is to remove the NVIDIA drivers. It is not required, but NVIDIA drivers tend to cause issues with passthrough in one way or another, so it’s better to remove them altogether.

If you’re configuring your own work PC with mutiple GPUs, skip this step as without NVIDIA drivers you won’t be able to run UI applications. In this case, the passthrough robustness is likely not a priority for you. However, I strongly recommend removing NVIDIA drivers on headless servers.

If the NVIDIA driver is installed from the repository, you can remove it using the following commands:

sudo apt-get remove --purge ‘^nvidia-.*’
sudo apt autoremove

If you’ve installed the driver using the RUN file, remove it using:

sudo /usr/bin/nvidia-uninstall

Remove configs if any.

sudo rm -rf /etc/X11/xorg.conf
sudo rm -rf /etc/modprobe.d/nvidia*.conf
sudo rm -rf /lib/modprobe.d/nvidia*.conf

Reboot the system after driver removal.

sudo reboot

2. Check BIOS, IOMMU Support, and IOMMU Group Assignment

The next step is to check virtualization and IOMMU support. We need to check four things:

Virtualization is enabled (AMD-Vi / Intel VT-D options are enabled in BIOS). If present, enable “Above 4G decoding” and “Resizable BAR (ReBAR)” options in BIOS as well.
IOMMU is active (groups exist).
Each GPU and its audio function are isolated in their own IOMMU group.
GPU groups contain only GPU/video-audio functions and PCI bridges — no NICs, NVMe, SATA, etc.

Enable IOMMU in Bios

You can use the following handy-dandy script to check those preconditions.

AI goes overboard when generating helper scripts, doesn’t it? I can’t complain, though. It provides a lot of useful information.

#!/usr/bin/env bash
# VFIO host sanity check: IOMMU support + GPU-containing groups

set -u  # don’t use -e so greps that find nothing don’t abort

# --- helpers ---------------------------------------------------------------
have() { command -v “$1” >/dev/null 2>&1; }

read_klog() {
  if have journalctl; then journalctl -k -b 0 2>/dev/null
  else dmesg 2>/dev/null
  fi
}

trim() { sed -e ‘s/^[[:space:]]*//’ -e ‘s/[[:space:]]*$//’; }

# --- 1) CPU vendor + boot flags -------------------------------------------
CPU_VENDOR=”$(
  (lscpu 2>/dev/null | awk -F: ‘/Vendor ID/{print $2}’ | trim) ||
  (grep -m1 ‘vendor_id’ /proc/cpuinfo 2>/dev/null | awk ‘{print $3}’)
)”
[ -z “${CPU_VENDOR}” ] && CPU_VENDOR=”(unknown)”

CMDLINE=”$(cat /proc/cmdline 2>/dev/null || echo ‘’)”
HAS_INTEL_FLAG=$(echo “$CMDLINE” | grep -q ‘intel_iommu=on’ && echo yes || echo no)
HAS_AMD_FLAG=$(echo “$CMDLINE” | grep -q ‘amd_iommu=on’ && echo yes || echo no)
HAS_PT_FLAG=$(echo “$CMDLINE” | grep -q ‘iommu=pt’ && echo yes || echo no)

# --- 2) Kernel log signals ------------------------------------------------
KLOG=”$(read_klog)”

DISABLED_MSG=$(echo “$KLOG” | egrep -i ‘IOMMU.*disabled by BIOS|DMAR:.*disabled|AMD-Vi:.*disabled’ || true)
ENABLED_MSG=$(echo “$KLOG” | egrep -i ‘DMAR: IOMMU enabled|AMD-Vi:.*IOMMU.*enabled|IOMMU: .*enabled’ || true)
IR_MSG=$(echo “$KLOG” | egrep -i ‘Interrupt remapping enabled’ || true)

# --- 3) IOMMU groups presence --------------------------------------------
GROUPS_DIR=”/sys/kernel/iommu_groups”
GROUP_COUNT=0
if [ -d “$GROUPS_DIR” ]; then
  GROUP_COUNT=$(find “$GROUPS_DIR” -mindepth 1 -maxdepth 1 -type d 2>/dev/null | wc -l | awk ‘{print $1}’)
fi

# Heuristic: active if groups exist (>0). Logs help explain state.
IOMMU_ACTIVE=”no”
[ “$GROUP_COUNT” -gt 0 ] && IOMMU_ACTIVE=”yes”

# --- 4) Report summary ----------------------------------------------------
echo “=== IOMMU Summary ===”
echo “CPU vendor           : $CPU_VENDOR”
echo “Kernel cmdline       : $CMDLINE”
echo “Boot flags           : intel_iommu=$HAS_INTEL_FLAG  amd_iommu=$HAS_AMD_FLAG  iommu=pt=$HAS_PT_FLAG”
echo “Groups directory     : $GROUPS_DIR  (exists: $([ -d “$GROUPS_DIR” ] && echo yes || echo no))”
echo “IOMMU group count    : $GROUP_COUNT”
echo “Kernel says enabled  : $([ -n “$ENABLED_MSG” ] && echo yes || echo no)”
echo “Interrupt remapping  : $([ -n “$IR_MSG” ] && echo yes || echo no)”
echo “Kernel says disabled : $([ -n “$DISABLED_MSG” ] && echo yes || echo no)”
echo “IOMMU ACTIVE?        : $IOMMU_ACTIVE”
echo

if [ -n “$ENABLED_MSG” ]; then
  echo “--- Kernel enable lines ---”
  echo “$ENABLED_MSG”
  echo
fi
if [ -n “$DISABLED_MSG” ]; then
  echo “--- Kernel disable lines ---”
  echo “$DISABLED_MSG”
  echo
fi

# --- 5) Original: list only GPU-containing groups -------------------------
echo “=== GPU-Containing IOMMU Groups ===”
if [ ! -d “$GROUPS_DIR” ] || [ “$GROUP_COUNT” -eq 0 ]; then
  echo “(no IOMMU groups found)”
else
  declare -A GPU_COUNT_BY_GROUP=()
  group_warnings=()

  for g in “$GROUPS_DIR”/*; do
    [ -d “$g” ] || continue
    group_num=$(basename “$g”)
    gpu_found=false
    device_lines=”“
    non_gpu_non_bridge=false
    gpu_count_in_this_group=0

    for d in “$g”/devices/*; do
      [ -e “$d” ] || continue
      pci_addr=$(basename “$d”)
      # -nns prints class code [XXXX] and vendor:device [vvvv:dddd]
      line=$(lspci -nns “$pci_addr” 2>/dev/null || echo “$pci_addr (unlisted)”)
      device_lines+=”$line”$’\n’

      # Extract first [...] which is the class code, e.g. 0300, 0302, 0403, 0604, 0600
      class_code=$(echo “$line” | awk -F’[][]’ ‘{print $2}’)

      # Detect GPUs / 3D controllers and their HDA audio functions
      if echo “$line” | grep -qE ‘VGA compatible controller|3D controller’; then
        gpu_found=true
        gpu_count_in_this_group=$((gpu_count_in_this_group+1))
      fi

      # Allowlist: 0300(VGA), 0302(3D), 0403(HDA audio), 0600(host bridge), 0604(PCI bridge)
      case “$class_code” in
        0300|0302|0403|0600|0604) : ;;
        *) non_gpu_non_bridge=true ;;
      esac
    done

    if $gpu_found; then
      echo “IOMMU Group $group_num:”
      echo “$device_lines”

      # Track GPUs per group
      GPU_COUNT_BY_GROUP[”$group_num”]=$gpu_count_in_this_group

      # Warn if unexpected devices share the group with the GPU
      if $non_gpu_non_bridge; then
        group_warnings+=(”WARN: Group $group_num contains non-GPU, non-audio, non-bridge devices (consider different slot/CPU root complex or ACS).”)
      fi
    fi
  done

  # Post-checks
  # 1) Each GPU should be alone (one GPU per group)
  shared_groups=()
  for gnum in “${!GPU_COUNT_BY_GROUP[@]}”; do
    if [ “${GPU_COUNT_BY_GROUP[$gnum]}” -gt 1 ]; then
      shared_groups+=(”$gnum”)
    fi
  done

  if [ “${#shared_groups[@]}” -gt 0 ]; then
    echo
    echo “WARN: Multiple GPUs share these IOMMU groups: ${shared_groups[*]} (prefer one GPU per group for VFIO).”
  fi

  # 2) Any non-bridge co-residents?
  if [ “${#group_warnings[@]}” -gt 0 ]; then
    echo
    printf “%s\n” “${group_warnings[@]}”
  fi
fi

Here is what a good summary should look like:

=== IOMMU Summary ===
CPU vendor           : AuthenticAMD
Kernel cmdline       : BOOT_IMAGE=/boot/vmlinuz-6.8.0-71-generic root=/dev/mapper/vgroot-lvroot ro systemd.unified_cgroup_hierarchy=false default_hugepagesz=1G hugepages=576 hugepagesz=1G nomodeset video=efifb:off iommu=pt pci=realloc pcie_aspm=off amd_iommu=on vfio-pci.ids=10de:0000,10de:204b,10de:22e8,10de:2bb1 modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
Boot flags           : intel_iommu=no  amd_iommu=yes  iommu=pt=yes
Groups directory     : /sys/kernel/iommu_groups  (exists: yes)
IOMMU group count    : 57
Kernel says enabled  : no
Interrupt remapping  : no
Kernel says disabled : no
IOMMU ACTIVE?        : yes

=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
c1:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 16:
c6:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 06)
c7:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 52)

IOMMU Group 27:
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
81:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 42:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 54:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2bb1] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

As we can see, IOMMU support is enabled, and all GPUs and their corresponding audio devices are in separate IOMMU groups.

Sometimes you may see PCI bridges in the GPU IOMMU group. This is normal.

=== GPU-Containing IOMMU Groups ===
IOMMU Group 13:
40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
40:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
41:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

IOMMU Group 32:
20:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
20:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
25:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
25:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

3. Leverage 1G Huge Pages

This step is optional. However, if you have more than 512GB of RAM on your system, it is highly encouraged. From experience, aside from providing performance benefit, 1 GB huge pages make the VM startup much more reliable on high-memory systems.

Rule of thumb

< 128 GB RAM: usually skip (benefit is small).
128–512 GB: optional; can reduce latency jitter.
> 512 GB: recommended for reliability and predictable performance.

Why 1 GiB pages help

Fewer page-table walks → fewer TLB misses.
Lower page management overhead.
More predictable VM start times on large RAM allocations.

3.1 Check Huge Page Support

To confirm 1G huge page support on your system, check the pdpe1gb CPU flag.

grep -m1 pdpe1gb /proc/cpuinfo >/dev/null && echo “✓ CPU supports 1GiB pages” || echo “✗ No 1GiB page support”

3.2 Allocate Huge Pages

Determine how much memory you want to reserve for the VMs. You need to reserve that much memory for huge pages plus a buffer.

Note that the memory reserved for huge pages will not be usable on the host system.

For example, if you want to dedicate 2000GB to virtual machines with an 80GB buffer, you would need 2080 huge pages.

I use the following empirically validated table to determine the huge page configuration on a high-memory multi-GPU system.

Is there a reliable formula to determine the huge page buffer size? Good question. If you know one, let me know in the comments. It makes sense that we need to leave some memory for the system, but it feels that the gap between memory dedicated for VM allocation and the number of huge pages is not necessary. After VM startup we’ll see that the system has allocated the exact number of requested huge pages, so why do we need a buffer and how big should it be? Is it because of the fragmentation? Empirically, I’ve confirmed that it is needed. Without a buffer I was occasionally running into OOM errors.

Run the following command to allocate 2000 pages (it will take a while):

echo 2000 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

To check that huge pages were allocated, run

grep -i huge /proc/meminfo

Look at Hugepagesize and Hugetlb. They tell the huge page size and the total amount of RAM allocated for huge pages. You should see output like this:

AnonHugePages:     79872 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2080
HugePages_Free:     1580
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        2181038080 kB

To deallocate, invoke:

echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

3.3 Make Huge Pages Persistent

Edit the /etc/default/grub file and modify the line containing GRUB_CMDLINE_LINUX.

Add default_hugepagesz=1G hugepagesz=1G hugepages= to the GRUB_CMDLINE_LINUX options.

The is the number of huge pages to allocate. For example:

GRUB_CMDLINE_LINUX=”... default_hugepagesz=1G hugepagesz=1G hugepages=200”

Be careful. If you specify more huge pages than the system can allocate, the machine will not boot.

Update the GRUB changes, reboot, and verify that huge pages are allocated (or do this in the end).

sudo update-grub
sudo reboot

3.4 (Optional) Mount Huge Page Table

Many systems already have /dev/hugepages. If not, or if you want a dedicated mount:

sudo mkdir -p /mnt/hugepages-1G
sudo mount -t hugetlbfs -o pagesize=1G none /mnt/hugepages-1G

Check that the mount point is present by running

hugetlbfs /dev/hugepages hugetlbfs rw,nosuid,nodev,relatime,pagesize=1024M 0 0
hugetlbfs /mnt/hugepages-1G hugetlbfs rw,relatime,pagesize=1024M 0 0

To persist — invoke:

echo “none /mnt/hugepages-1G hugetlbfs pagesize=1G 0 0” | sudo tee -a /etc/fs

3.5 Configure your Virtualization Software to use Huge Pages

Neither Proxmox nor libvirt is using huge pages by default.

To use them in libvirt, you need to add the following section to the domain XML:

In Proxmox CLI, you do it as follows:

qm set  --hugepages 1024   # use 1GiB pages
qm set  --keephugepages 1  # optional: keep reserved after shutdown

4. Bind to VFIO Early

For maximum stability, have VFIO claim the GPU at boot so no runtime driver swaps occur (Proxmox/libvirt will otherwise bind/unbind around VM start/stop).

I strongly recommend this step on headless servers. However, for your local PC setup you might need to resort to dynamic binding as VFIO driver will claim your main GPU.

4.1 Identify the PCI IDs to bind

First, you need to determine the PCI vendor ID and device ID for your GPUs.

List all NVIDIA functions (display + audio, and any auxiliary functions):

lspci -nn | grep -i nvidia

Example (RTX 5090):

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2b85] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22e8] (rev a1)

4.2 Give VFIO first claim

Add the following lines to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, replacing the PCI vendor ID and device ID with the appropriate values. Keep other options if needed.

GRUB_CMDLINE_LINUX_DEFAULT=”modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel vfio-pci.ids=10de:2b85,10de:22e8 ...”

Proxmox is likely using systemd-boot by default instead of GRUB. Check the bootloader you’re using and adjust the kernel command line accordingly.

Many online manuals suggest adding VFIO modules to /etc/modprobe.d/vfio.conf, but this approach has not always worked for me. I recommend early binding via the kernel command line.

4.3 Ensure VFIO is in the initramfs

We need to make sure that VFIO modules are loaded early in the boot process. To achieve this, we include them in the initramfs.

sudo tee -a /etc/initramfs-tools/modules >/dev/null <<’EOF’
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF

4.4 Reboot and verify

Update GRUB, initramfs, and reboot.

sudo update-initramfs -u -k all
sudo update-grub
sudo reboot

After reboot, check that VFIO drivers are in use. You can use

lspci -k | grep -A 2 -i nvidia

You should see VFIO drivers in use.

81:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1)
    Subsystem: Gigabyte Technology Co., Ltd Device 416f
    Kernel driver in use: vfio-pci
    Kernel modules: nvidiafb, nouveau
81:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
    Subsystem: NVIDIA Corporation Device 0000
    Kernel driver in use: vfio-pci
    Kernel modules: snd_hda_intel

To be fair, there was one machine where this technique to bind VFIO failed. The system was aggressively binding snd_hda_intel driver to the GPU audio function. However, this method worked for me in all other cases.

5. Other GRUB Options

Here is a summary of other kernel command line options that you may want to consider, along with my thoughts on each.

pci=realloc: Reallocate PCI resources forces the kernel to reassign PCI bus resources (MMIO/IOBARs) from scratch, ignoring what the firmware/BIOS assigned. It helps avoid issues when the BIOS didn’t allocate enough space for devices (common with large GPUs or multiple devices). Fixes “BAR can’t be assigned” or “resource busy” errors. This option is helpful. I also like to include it in the guest OS kernel parameters. It occasionally helps to work around BAR allocation issues. However, there is no need to list it unless the system has PCI device enumeration issues.
iommu=pt: IOMMU passthrough mode tells the kernel to enable the IOMMU but use pass-through mode for DMA mappings by default.
For VFIO GPU passthrough — allows the device to access physical memory directly with minimal performance penalty. I haven’t had a chance to test the performance gains, so I can only say that this option didn’t cause any issues.
pcie_aspm=off: Disable PCIe Active State Power Management, which is a power-saving feature that reduces PCIe link power in idle states. Some PCIe devices (especially GPUs) have trouble retraining links or waking from ASPM low-power states, leading to hangs or device-inaccessible errors. This option was introduced to my configurations after I lost a lot of time on the Reset Bug. It didn’t help. I don’t consider this option helpful at the moment, but I am still evaluating it.
nomodeset: Disable kernel mode setting (KMS) for all GPUs; prevents DRM drivers from taking over the console. This option is intended for use with headless servers only. It can break desktop/console output. I typically use it since we’re working with headless servers only.
video=efifb:off: Disables the firmware EFI framebuffer, so simpledrm/efifb won’t grab the boot GPU before VFIO claims it. This option is outdated and has no effect on systems with modern kernels. I list it for completeness.
intel_iommu=on / amd_iommu=on: Enable IOMMU support for Intel and AMD. These are enabled by default, so there is no need to add them to kernel parameters.

Here is how the typical kernel command line should look on a headless server with over 500GB of RAM.

nomodeset
modprobe.blacklist=nouveau,nvidia,nvidiafb,snd_hda_intel
vfio-pci.ids=10de:2b85,10de:22e8
default_hugepagesz=1G hugepagesz=1G hugepages=400

Conclusion

The VFIO GPU passthrough is a finicky process. It is sensitive to host hardware and software configuration. However, with enough diligence, you can make it robust and reliable. I strongly believe in this approach and rely on VFIO GPU passthrough as the primary tool for our GPU rental service at cloudrift.ai.

I hope this guide helped you to improve your homelab or data center setup. If you notice inaccuracies or have suggestions, please don’t hesitate to let me know, so we can work together to improve the workflow.

Final host checklist:

Enable IOMMU, Above 4G, and (where applicable) ReBAR in the BIOS.
Verify clean IOMMU groups; each GPU (+ audio) isolated.
Bind to vfio-pci early.
Size huge pages (1 GiB on high-RAM hosts) and confirm in /proc/meminfo.
Configure other kernel command-line options as needed.

Evolution of GPU Programming

Dmitry Trifonov — Tue, 07 Apr 2026 02:12:15 GMT

Every decade GPUs reinvented themselves — from drawing triangles to generating worlds, and now, reasoning with language. I have realized that throughout my entire programming journey, I have been working closely with GPUs and tried countless ways to program them. From writing pixel shaders in GLSL to implementing real-time 3D scanning algorithms in OpenCL to optimizing deep learning models in PyTorch and Tensorflow. So what can be a better way to share my experience than to write a blog post about the evolution of GPU programming, full of nostalgia and memes?

A lot has changed in the GPU programming landscape over the years. New programming models, new frameworks, and new hardware architectures have emerged. There is no point in studying them nowadays; however, the evolutionary path of GPU programming is quite interesting. If you’re an AI expert or a developer in another field, it can help you broaden your expertise or provide the necessary inspiration to dive into the world of GPU programming. It can give you new ideas to address current problems, especially given that some of the issues we face today in AI were already faced by graphics programmers 25 years ago.

Here is a mildly entertaining, nostalgia-induced journey through the history of GPU programming from making brick walls that look bumpy in 2000 to optimizing attention mechanisms in LLM models in 2025. Feel free to skip the code snippets if you’re not interested in programming or are already familiar with the material and would rather enjoy the story.

Smart Pixels

In the early 2000s, the GPUs were used exclusively for visualization, and the rendering pipeline was completely fixed-function. It was akin to HTML, where you would predefine your scene: geometry, textures, position of lights and camera, and the GPU would take care of rendering it. You could, of course, customize it on the fly, but only in a limited way, by changing parameters of the predefined functions, and this customization has happened entirely on the CPU side.

Here is a simple example of rendering a triangle using old-school OpenGL, taken from here.

// Set every pixel in the frame buffer to the current clear color.
glClear(GL_COLOR_BUFFER_BIT);

// Drawing is done by specifying a sequence of vertices. The way these
// vertices are connected. GL_POLYGON constructs a filled polygon.
glBegin(GL_POLYGON);
  glColor3f(1, 0, 0); glVertex3f(-0.6, -0.75, 0.5);
  glColor3f(0, 1, 0); glVertex3f(0.6, -0.75, 0);
  glColor3f(0, 0, 1); glVertex3f(0, 0.75, 0);
glEnd();

// Flush drawing command buffer to make drawing happen as soon as possible.
glFlush();

Rendering a triangle with OpenGL

The idea that you can actually program how pixels are rendered on the screen was quite revolutionary in the early 2000s.

My first interaction with these ideas was through an article from 2001 on a popular Russian game-development website about the NV_register_combiners extension for OpenGL. Surprisingly, the article is still available online.

This extension enabled you to program how the final color of a pixel is computed from various inputs, such as texture colors and lighting, allowing you to create more complex visual effects. This computation is performed on the GPU, enabling real-time performance. It was akin to running a small assembly program on the GPU for each pixel being rendered.

Rendering small surface details with NV register combiners.

Graphics developers were fascinated by this idea, as it enabled them to increase the visual fidelity of the scenes dramatically. Shortly after, the GLSL was conceptualized and formally introduced in 2004, allowing the writing of more complex shaders (small programs that define how to manipulate geometry or pixels) in a C-like language.

Are you feeling GPU poor? Imagine that it was even worse back then! Every new generation of GPUs introduced new features and capabilities, dramatically increasing the visual fidelity of games. Having a new GPU was a prerequisite for playing the latest and greatest games. For those into computer graphics, the frustration of the wait and the excitement of getting the new card were doubled! Luckily, I could trick my parents into buying me a new card, because it supported SHADERS! Which, of course, were essential to advance my computer science education. Having the ability to play Oblivion on high settings was just a nice bonus.

The “ugly” by today’s standards non-remastered Oblivion looked gorgeous in 2006. It has won the “Best Graphics” and “Best Technology” awards from several gaming websites. Notice how the bump mapping algorithm was leveraged to create the illusion of bumps on the wall and wrinkles on the emperor's face. In the original version, the wall is entirely flat, and wrinkles are not modelled in the geometry — image from Reddit.

Here is an example of a simple GLSL program from rastertek.com to perform bump mapping, the effect achieved by perturbing the surface normals of a texture to simulate small-scale bumps and wrinkles on the surface of an object.

in vec2 texCoord;
in vec3 normal;
in vec3 tangent;
in vec3 binormal;

void main(void)
{
    // Sample the pixel color from the texture using the sampler at this texture coordinate location.
    vec4 textureColor = texture(shaderTexture1, texCoord);

    // Sample the pixel from the normal map.
    vec4 bumpMap = texture(shaderTexture2, texCoord);

    // Expand the range of the normal value from (0, +1) to (-1, +1).
    vec3 bumpMap = (bumpMap * 2.0f) - 1.0f;

    // Calculate the normal from the data in the normal map.
    bumpNormal = (bumpMap.x * tangent) + (bumpMap.y * binormal) + (bumpMap.z * normal);

    // Normalize the resulting bump normal.
    bumpNormal = normalize(bumpNormal);

    // Calculate the amount of light on this pixel based on the normal map value.
    float lightIntensity = clamp(dot(bumpNormal, -lightDirection), 0.0f, 1.0f);

    // Determine the final amount of diffuse color based on the diffuse color combined with the light intensity.
    outputColor =  clamp((diffuseLightColor * lightIntensity), 0.0f, 1.0f);

    // Combine the final light color with the texture color.
    outputColor = outputColor * textureColor;
}

What do all these in vec3 variables mean? These are the inputs to the shader program. Those are specified per vertex and interpolated across the surface of the triangle being rendered. The interpolation is done by GPU hardware and fed into the shader program for each pixel being rendered. This way, you can have different values for each pixel, allowing for more complex effects. This allows for parallelization of the computation across all pixels being rendered, as each pixel can be processed independently.

Shaders quickly progressed from simple pixel color manipulation to complex effects simulating shadows, reflections, and refractions. Graphics programmers were especially obsessed with simulating complex surface details without increasing the geometric complexity of the scene. The deepest point of this rabbit hole was a Parallax Occlusion Mapping technique, which performs a type of ray-marching in a pixel shader, i.e., traversing space to determine the intersection of a ray with a surface defined by a heightmap texture. This way, a completely flat surface can appear to have complex 3D details.

Parallax Occlusion Mapping technique. The cube’s surface is entirely flat, but it appears to have details — image from babylon.js.

GPUs as General Purpose Computers

At this point, you may wonder about LLMs, deep learning, and the ability to perform general-purpose computations on GPUs. However, take a look at the shader program above. It is just like a piece of C code. Why can’t we use that to perform arbitrary computations on the GPU? Indeed, we can, and people have been doing so since the early 2000s. However, we need to address one problem first. How do we get data in and out of the GPU?

Getting data in is pretty straightforward. We can encode our data as a texture or geometry and upload it to the GPU. But how do we get data out? To help with that, we can use techniques like render to texture. It allows us to render the output of our shader program to a texture instead of the screen. Then we can read that texture back to the CPU.

For those not familiar with computer graphics terms. Texture is just an image. In computer graphics, textures are used to store image data that can be applied to the surface of 3D models to give them color and detail. A texture is typically a 2D array of pixels, where each pixel contains color information (e.g., RGB values) and sometimes additional information like alpha (transparency) or normal vectors for bump mapping.

This technique is actually even older than shaders themselves, as it was used in the pre-shader era to create effects like dynamic reflections and shadows. For example, to create a reflection effect, you can render the scene from the point of view of a reflected camera (e.g., below the water surface) to a texture, and then use that texture to render the water surface. You can use a pixel shader to distort the texture coordinates, simulating water ripples.

A schematic visualization of how the render-to-texture technique can be used to simulate reflections

An example of a water reflection effect achieved via the render-to-texture technique. Apparently, I was too lazy to fix the face orientation on the yacht model at the time of making that demo.

Some ingenious people figured out that you can use this technique to perform arbitrary computations on the GPU by encoding your input data as a texture, writing a shader program to perform the calculation, rendering the output to a texture, and then reading that texture back to the CPU.

What can you achieve with this technique? Everything you can with CUDA today. A popular technique in early GPGPU was to use ping-pong rendering, where two textures are alternated for reading and writing. This way, you can compute, take your input texture, compute some function on it, write the result to the output texture, then use that output texture as input for the following computation, and so on. This way, you can build complex computations by chaining together multiple shader programs. And you don’t have to work with images specifically. You can encode any data as a texture, e.g., a 2D array of floats, a 3D volume of voxels, a graph, and so on.

For example, the Fast Fourier Transform (FFT) algorithm can be implemented using shaders and the render-to-texture technique. Here is an example of a GPU-based FFT implementation from GPU Gems 2, along with its medical image reconstruction.

Here is how a fragment shader for a single FFT pass looks. It is similar to the CUDA kernel you would write today, as shown below. It is essentially a function that is invoked for each pixel of the output texture. It reads data from the input textures, performs some computation, and writes the result as the color of the pixel, which is then stored in the output texture.

void FragmentProgram(in float4 TexCoordRect
                     : TEXCOORD0, out float4 sColor0
                     : COLOR0, out float4 sColor1
                     : COLOR1, out float4 sColor2
                     : COLOR2, out float4 sColor3
                     : COLOR3, uniform samplerRECT Real1,
                       uniform samplerRECT Imag1, uniform samplerRECT Real2,
                       uniform samplerRECT Imag2,
                       uniform samplerRECT ButterflyLookupI,
                       uniform samplerRECT ButterflyLookupWR,
                       uniform samplerRECT ButterflyLookupWI)
{
  // Read in butterfly indices
  float4 i = texRECT(ButterflyLookupI, TexCoordRect.xy);
  // Read in scrambling coordinates
  float4 WR = texRECT(ButterflyLookupWR, TexCoordRect.xy);
  // Read in weights
  float4 WI = texRECT(ButterflyLookupWI, TexCoordRect.xy);
  
  // Perform the butterfly operation, storing results in the output colors
  float2 Res;
  float2 r1 = float2(i.x, TexCoordRect.y);
  float2 r2 = float2(i.w, TexCoordRect.y);
  float4 InputX1 = texRECT(Real1, r1);
  float4 InputY1 = texRECT(Imag1, r1);
  float4 InputX2 = texRECT(Real1, r2);
  float4 InputY2 = texRECT(Imag1, r2);
  Res.x = WR.x * InputX2.x - WI.x * InputY2.x;
  Res.y = WI.x * InputX2.x + WR.x * InputY2.x;
  sColor0.x = InputX1.x + Res.x;
  sColor1.x = InputY1.x + Res.y;
  float4 InputX1_ = texRECT(Real2, r1);
  float4 InputY1_ = texRECT(Imag2, r1);
  float4 InputX2_ = texRECT(Real2, r2);
  float4 InputY2_ = texRECT(Imag2, r2);
  Res.x = WR.x * InputX2_.x - WI.x * InputY2_.x;
  Res.y = WI.x * InputX2_.x + WR.x * InputY2_.x;
  sColor2.x = InputX1_.x + Res.x;
  sColor3.x = InputY1_.x + Res.y;
}

The code above is written in Cg language. It is an early attempt by Nvidia to dominate the graphics computing market… I meant to say, simplify shader programming. Luckily, nobody cared about it, and the market relied on a more universally supported GLSL and HLSL languages.

I was fascinated by these developments! This technique unlocked a remarkable number of new applications in computer graphics, science, and the medical field, among others. Personally, I’ve used it to implement advanced graphics effects. Here is an example of using FFT to generate a complex water surface. This technique was used in the movie Titanic and in some advanced games, such as Assassin’s Creed.

Are any of those articles worth reading? Of course, not. I want to demonstrate how I used Web-Archive to recover some old articles that are no longer available online and add a meme image to the post.

Enter the CUDA

Although the technique of using shaders for general-purpose computations was quite powerful, it was still somewhat limited. The programming model was not very friendly, as you had to encode your data as textures or other graphics primitives. The render-to-texture approach involves rendering a rectangular area of the entire screen, ensuring that all rendered pixels align precisely with the texels of the output texture. It was easy to misconfigure the graphics pipeline, such as forgetting to turn off texture filtering, which would lead to incorrect results.

All of these details were quite distracting and made it hard to focus on the actual computation, especially for non-graphics programmers. Thus, NVIDIA introduced CUDA in 2007, which provided a C-like programming model for writing general-purpose computations on NVIDIA GPUs.

The programming model is similar to the shader programming model, as you still write a kernel function that is executed in parallel by many threads. Each thread is identified by its 1D, 2D, or 3D index, which you can use to compute the memory address of the data you want to process. In the shader programming model, you would do that using texture coordinates or other varying variables, while you would use thread indices. However, all the scaffolding of setting up the graphics pipeline, managing textures, framebuffers, and so on, is eliminated. You can allocate memory on the GPU, copy data to it, launch a kernel, and copy the results back.

Here is how the FFT kernel from above would look in CUDA. Again, feel free to skip if you’re here for the story.

// Helper function to perform a complex multiply and add operation.
__device__ float2 butterfly_op(float2 a, float2 b, float2 twiddle) {
    // Perform complex multiplication and addition
    float2 temp_result;
    temp_result.x = b.x * twiddle.x - b.y * twiddle.y;
    temp_result.y = b.y * twiddle.x + b.x * twiddle.y;
    return a + temp_result;
}

__global__ void fft_stage_kernel(
    // Input data arrays (now using float2 for complex numbers)
    float2 *d_input1,
    float2 *d_input2,

    // Combined butterfly lookup tables (now float2 for complex twiddle factors)
    float *d_butterflyLookupI,
    float2 *d_butterflyTwiddles,

    // Output data arrays (now using float2)
    float2 *d_out1,
    float2 *d_out2,

    int width,
    int height
) {
    int tx = blockIdx.x * blockDim.x + threadIdx.x;
    int ty = blockIdx.y * blockDim.y + threadIdx.y;

    if (tx >= width || ty >= height) {
        return;
    }

    int index = ty * width + tx;

    // Read butterfly lookup index and complex twiddle factor
    int lookup_i = (int)d_butterflyLookupI[index];
    float2 twiddle_factor = d_butterflyTwiddles[index];

    // Read input data using combined float2 arrays
    int r1_idx = ty * width + tx;
    int r2_idx = ty * width + lookup_i;

    float2 input1 = d_input1[r1_idx];
    float2 input2 = d_input1[r2_idx];

    // Perform the butterfly operation for the first pair of inputs
    d_out1[index] = butterfly_op(input1, input2, twiddle_factor);

    // Process the second pair of data arrays
    float2 input1_prime = d_input2[r1_idx];
    float2 input2_prime = d_input2[r2_idx];

    // Perform the second butterfly operation
    d_out2[index] = butterfly_op(input1_prime, input2_prime, twiddle_factor);
}

I was waiting to get my hands on a GPU that supported CUDA, again. I was earning money, so there was no need to trick my parent anymore, but high-end PC upgrades were still a considerable expense, and you needed to do them often. My first CUDA-capable GPU was the 8800GT, a GPU from the most legendary series of all time. It leveraged entirely new architecture and has introduced CUDA. In addition, asingle 8800 GTX card was able to outperform two previous-generation 7900 GTX cards in SLI and had comparable power consumption and price ($599 — hold your tears in your eyes). When will we see such leaps in performance and value again, Mr. Leather-jacket CEO?

An entry-level GPU in 2030 with an MSRP of $8799. Image from Reddit.

CUDA Moat?

As a true open-source warrior, I did not use CUDA and relied on OpenCL instead for my work. It was not as well-supported as CUDA: debuggers and other tools were not so advanced, there were more glitches, and you could get slightly better performance out of CUDA on NVIDIA hardware. However, its drawbacks were outweighed by the fact that it was an open standard and worked on both AMD and Intel GPUs, so CUDA was far from being a monopoly at that time.

At my job, I was using OpenCL to implement an algorithm for real-time 3D scanning. The Artec Eva is a professional 3D scanner used for medical or industrial applications. Real-time 3D scanning involves a significant amount of GPU computation to process the input video stream, identify your position with respect to the environment (similar algorithms are employed as in self-driving cars), fuse all the input data into a single 3D model, and display it on the screen. All of this had to happen in real-time, so the user could see the result immediately and adjust their position if needed.

Real-time scanning of a 3D object with an Artec scanner. Scanner localization, data fusion, and visualization are performed in real-time on a GPU using OpenCL.

I opted for OpenCL, which was a brave choice back then and possibly a bad product decision at the time, as when you buy a $12000 3D scanner, you can afford a decent GPU and not worry about vendor lock-in. However, over time, as GPUs became more powerful and it became possible to run the pipeline on a laptop GPU, specifically Microsoft Surface tablet, the choice of OpenCL has become more relevant. Now, an operator had a lightweight display in their hands and could walk around the object being scanned. At least, this is what I tell myself to feel better about my choice 😅

In addition to OpenCL, there were many other hardware-agnostic GPGPU frameworks to choose from, including Halide, ArrayFire, and Numba. So, all things considered, the open-source and open-standard ecosystem was a fair contender to CUDA back then, and CUDA hasn’t had its moat yet.

Deep Learning Revolution

The new GPU programming capabilities unlocked by CUDA/OpenCL have enabled numerous new applications in computer graphics, science, and medical fields, among others. However, the popularization of deep learning (this is how we’ve called AI before ChatGPT came along) is arguably the most noticeable outcome.

Many think that thanks to AI, the GPUs have become the central compute platform. In fact, it is the other way around. Thanks to GPUs, we have AI in the first place. Deep convolutional neural networks have been known since 90s. In 2012, a graduate student, Alex Krizhevsky, motivated by Ilya Sutskever, trained a deep convolutional neural network under the guidance of Geoffrey Hinton using a couple of GeForce GPUs to enter the ImageNet challenge. The model was called AlexNet, and the dataset consisted of 1.2 million images belonging to 1,000 categories.

The obligatory xkcd meme. The original.

The results? They have obliterated the state-of-the-art computer vision models at the time, demonstrating a whopping 9.4% increase in accuracy over the previous state-of-the-art. This was a game-changer. It has triggered a deep learning revolution, where all breakthroughs in computer vision, natural language processing, and other fields were achieved using deep learning models trained on GPUs.

The best ImageNet challenge results in 2010 and 2011, compared against all results in 2012, including AlexNet. Image from Pinecone’s article: AlexNet and ImageNet: The Birth of Deep Learning

The Array Programming Model

GPU computing has caused great upheaval in the machine learning field, while the latter has retaliated by drastically changing the way we program GPUs. The programming model has shifted from writing kernels that operate on individual elements of an array to writing code that operates on entire arrays (tensors) at once.

The reason for this is that deep-learning frameworks like Tensorflow or PyTorch were inspired not by graphics programming, but scientific computing frameworks like NumPy and MATLAB. The programming model differs significantly from those of CUDA and OpenCL. Instead of writing kernels that operate on individual elements of an array, you write code that operates on entire arrays (tensors) at once. The framework breaks down the operations into smaller pieces that can be executed in parallel on the GPU. This programming model, known as array programming, dates back to the 60s with the development of languages like APL and Fortran.

I am skipping the first and popular at the time declarative deep learning framework Caffe. It was suitable for defining a large number of models, but it was not appropriate for expressing arbitrary computations on tensors.

This programming model has one tremendous advantage. It is much easier to reason about the code, as you don’t have to think about how to parallelize the computation. You write code that operates on entire arrays, and the framework takes care of the rest. It made GPU programming accessible to a much wider audience, as you didn’t have to be a GPU programming expert to write code that runs on the GPU. It is so convenient that many GPU programming experts, myself included, have switched to using these frameworks for their work. It allows you to express your ideas much more concisely and focus on the problem at hand, rather than the intricacies of GPU programming. Additionally, frameworks like PyTorch and Tensorflow come with an automatic differentiation engine, which allows you to compute gradients of your functions automatically. This is especially useful for training neural networks, but it can also be applied to other applications.

Here is a simple numpy program. Even without knowing numpy, you can figure out what it does. It creates a couple of arrays, performs some basic operations on them, and prints the results.

import numpy as np

# Create a 1-dimensional array from a Python list
array1d = np.array([1, 2, 3, 4, 5])

# Create a 2-dimensional array (matrix)
array2d = np.array([[10, 20, 30], [40, 50, 60]])

# Element-wise addition
sum_array = array1d + 5

# Element-wise multiplication
product_array = array1d * 2

# Sum of all elements in an array
total_sum = np.sum(array1d)

# Mean of elements in an array
mean_value = np.mean(array1d)

# Accessing elements
print(”First element of array1d:”, array1d[0])
print(”Element at row 0, column 1 of array2d:”, array2d[0, 1])

Why Programming GPUs is Hard?

With the convenience of the array programming model comes a significant drawback. It is hard to optimize the code for performance. To understand the reason, we first need to consider why it is hard to optimize code for GPUs in the first place.

There are several reasons why GPU programming is a complex task. Still, the primary limitation is that memory bandwidth heavily restricts GPUs, so GPU architects have introduced numerous complex mechanisms to hide the latency of memory accesses and maximize the utilization of available bandwidth. Developers need to understand these mechanisms and write code that leverages them. This is not an easy task, as it requires a deep understanding of the GPU architecture and the specific details of the memory hierarchy.

Think about the following example. The most powerful CPU at the time of writing is AMD EPYC 9965. It offers a whopping 192 cores and 384 threads. The per-socket memory bandwidth is about 614 GB/s. However, its number of cores pales in comparison with the most powerful GPU, which is NVIDIA B200 at the time of writing. It offers 16,896 CUDA cores and up to 8TB/s of memory bandwidth per GPU.

Now, you might see the problem: each CPU core has about 3.2 GB/s of memory bandwidth, while each GPU core has only about 0.47 GB/s of memory bandwidth. This means that each GPU core must perform significantly more work to hide the latency of memory accesses and make the best use of the available bandwidth. The situation with consumer GPUs is even worse, e.g. the RTX 5090 has 21,760 CUDA cores and 1,792 GB/s of memory bandwidth, which gives only about 0.082 GB/s per core. This means that GPUs must perform significantly more computations per memory access to achieve optimal performance.

The relationship between compute power and memory bandwidth in the GPU computing world is referred to as the ALU-to-memory ratio, which represents the number of operations a GPU core can perform per memory access. For GPUs, this ratio is much higher than for CPUs. It can be dozens or even hundreds of operations per memory access.
The same problem exists for all other parallel computing platforms, such as TPUs, neural processors, and FPGAs. The memory bandwidth per processing unit is always much lower than that of a CPU core. Between 2017 and 2022, I was optimizing neural network inference at Apple for their custom neural processors. We have shipped models such as Animoji, FaceID, Portrait mode, and numerous models that run on Apple Vision Pro. For each of these models, we’ve had to ensure there is no swapping of data between the on-chip memory and DRAM, as the memory bandwidth was the main bottleneck.

To work around this limitation, GPUs employ several techniques, such as using shared memory —a small amount of memory shared among a group of threads. This allows threads to cooperate and share data without accessing global memory, which is significantly slower. Another technique is to use memory coalescing, which enables threads to access memory in a way that minimizes the number of memory transactions. This is achieved by ensuring that threads access contiguous memory locations, which allows the GPU to fetch multiple data elements in a single memory transaction. GPU cores also have access to more registers than CPU cores, which can also be used to store intermediate data. However, registers are shared among cores (threads in a workgroup), so if you’re using too many, some cores will be turned off.

GPU-machine memory hierarchy for NVIDIA Fermi (2010) architecture — transfer speeds on modern GPUs are about 5–10 times faster, but relationships are similar. Illustration from the publication Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units

Enough complex terms! If you’re to take out one thing from this post, it is this: the most effective way to optimize a GPU program is to perform more computations per memory access. In other words, ensure that data doesn’t leave the GPU core for as long as possible. Let’s pin this and come back to the array programming model and the performance issues that it introduces.

I Love PyTorch! What Can Possibly be Wrong with It?

Let’s take a look at how a simple CUDA kernel to perform an array operation like A*B + C would look. Here, A, B, and C are large arrays (tensors) and the operation is performed element-wise, e.g., [1, 2, 3] * [2, 2, 2] + [1, 1, 1] = [3, 5, 7]

__global__ void array_op(const float *A, const float *B, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = A[idx] * B[idx] + C[idx];
    }
}

This kernel is straightforward. Each thread computes a single element of the output array D by reading the corresponding elements from the input arrays A, B, and C.

Now, let’s take a look at how the same operation would look in PyTorch.

import torch

A = torch.randn(1000000, device=’cuda’)
B = torch.randn(1000000, device=’cuda’)
C = torch.randn(1000000, device=’cuda’)
D = A * B + C

If you naively translate PyTorch operations, such as element-wise multiplication and addition, to CUDA, which is how it is actually done in practice, you would get two kernels: one for multiplication and one for addition. The runtime would launch a kernel to perform element-wise multiplication, store the result in a temporary array A1, and then launch another kernel to perform element-wise addition using A1 and B to produce the final tensor C.

__global__ void array_mul(const float *A, const float *B, float *E, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        E[idx] = A[idx] * B[idx];
    }
}

__global__ void array_add(const float *E, const float *C, float *D, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        D[idx] = E[idx] + C[idx];
    }
}

You can see the problem now:

In the original example, we fetch data once from memory and perform two operations (multiplication and addition) on it.
In the PyTorch example, we fetch data twice from memory and perform only one operation (multiplication or addition) on it.

Given that our program is completely memory-bound, the PyTorch version will be practically twice as slow as the CUDA version, because it performs half the number of operations per memory access. If we add one more unfused element-wise operation, it will be three times as slow, and so on.

You may wonder, can’t we generate a single fused kernel that performs both operations simultaneously? The answer is yes, we can. In fact, both PyTorch and TensorFlow have a mechanism to do that. However, this is not an easy problem to solve in a general way. PyTorch officially supports more than 1200 operations that can be performed on tensors. The number of possible combinations of these operations is astronomical. Many of these operations are not even element-wise, e.g., matrix multiplication, convolutions, reductions, and so on. It is a complex problem to solve in a general way. For PyTorch, it is challenging, as it is a dynamic framework, i.e., the computation graph is built on-the-fly as the code is executed. This makes it difficult to analyze the entire computation graph and determine which operations can be fused.

This problem remains unsolved in a general way to date, as you’ll see when we discuss Flash Attention in the context of LLM inference.

NVidia Domination

The Deep Learning revolution has dramatically changed the GPU programming landscape. The array programming model has made GPU programming accessible to a much wider audience, as you don’t have to be a GPU programming expert to write code that runs on the GPU. However, it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

This has created a strong moat for NVIDIA. Although CUDA was just one of the many GPGPU frameworks available at the time, the CUDA ecosystem had a great deal more to offer the community. For example, it had CUDNN, a highly optimized library for deep learning primitives such as convolutions, pooling, and normalization. This library was used by all major deep learning frameworks like TensorFlow and PyTorch to achieve good performance on NVIDIA GPUs. Additionally, NVIDIA has invested heavily in optimizing its hardware for deep learning workloads, for example, by introducing Tensor Cores, which are specialized hardware units designed for performing matrix multiplications and convolutions.

Link to the original

In the Deep Learning age, the NVIDIA GPUs have become the de facto standard for deep learning workloads. All major deep learning frameworks like PyTorch and Tensorflow were built on top of CUDNN, initially not even offering the option to use other backends like OpenCL or ROCm. All research has been done on NVIDIA hardware, as it was the only hardware that supported the tools they were using. This has created a strong network effect, as everyone was using NVIDIA hardware, so everyone was optimizing their code for NVIDIA hardware, which made NVIDIA hardware even more attractive.

From 2010 to the present, I have exclusively owned NVIDIA GPUs. Even though some AMD models were offering more value, the need to be able to perform AI-related work has always steered me into the Team Green camp.

Ironically, as innovative as CUDA was, the moat was created not by CUDA itself, but by the army of NVIDIA engineers who have optimized CUDNN and other libraries for deep learning workloads. There was simply no good algorithm to optimize computational graphs in a general way, so NVIDIA engineers have hand-optimized the most common patterns that appear in deep learning workloads.

There are many attempts to come up with an automatic way to optimize computational graphs or at least to come up with a universal, hardware-agnostic AI stack that makes the optimization process easier, like XLA from Google, TVM from the Apache Foundation, MLIR from LLVM or MAX from Modular AI. However, none of these attempts have been able to beat hand-optimized libraries like CUDNN on NVIDIA hardware on a large enough number of real-world use cases and establish a strong enough network effect.

The AI Era — Bigger is Better

History doesn’t repeat itself, but it often rhymes. The computational power of GPUs has triggered the deep learning evolution. We’ve used the same algorithms that were known since the 90s, but now we can train much larger models on much larger datasets. The same thing happened with LLMs. The transformer architecture was known since 2017, but it was only in 2020 that we saw the first large-scale transformer models like GPT-3 and BERT. The reason for that is that training these models requires a lot of computational power and memory bandwidth. OpenAI has trained GPT-3 on a cluster of 10,000 GPUs. The largest model, GPT-3, has 175 billion parameters and was trained on a dataset of 570GB of text data. The training process took several weeks and cost several million dollars (and probably raised global temperature by a degree or so).

How did AI affect the GPU programming landscape? Not much, actually. The same array programming model is used for training and inference of LLMs. The same challenges of optimizing memory access patterns and fusing operations to achieve good performance still exist. However, the scale of the models has increased dramatically, which has introduced new challenges, like distributing the model across multiple GPUs and optimizing communication between GPUs.

The large scale of the models has also introduced new challenges for inference. The models are so large that they don’t fit into the memory of a single GPU. For example, GPT-3 requires about 700GB of memory to store the model parameters, which is much larger than the memory of even the most powerful GPUs available today. This has led to the development of techniques such as model parallelism, where the model is split across multiple GPUs, and pipeline parallelism, where different parts of the model are executed on separate GPUs in a pipelined manner.

Link to the original

The Case of Flash Attention

One of the most essential operations in transformer models is the attention mechanism. The attention mechanism allows the model to focus on different parts of the input sequence when making predictions. The attention mechanism is implemented using a series of matrix multiplications and softmax operations (see the rightmost diagram in the image below).

Attention mechanism in transformers. Illustration from the original paper.

The softmax operation involves computing the exponential of each element in the input matrix, summing them up, and then dividing each component by the sum.

Looks challenging to optimize, right? How can we reduce the number of memory accesses here? The naive implementation would involve reading the input matrices from memory, multiplying them together, storing the result in a temporary matrix, reading the temporary matrix from memory, computing the exponential of each element, summing them up, and then dividing each component by the sum. This would involve a lot of memory accesses and would be very slow. And it is slow!

However, previously I have mentioned that GPUs come with a bit of fast on-chip memory called shared memory (SRAM in hardware terms— static random access memory). It is a small amount of memory that is shared between a block of GPU cores. This memory is much faster than the global memory (GDDR or HBM) and can be used to store intermediate results. The original Flash Attention implementation was implemented and benchmarked on H100, which has 80GB of HBM memory and 192KB of shared memory per SM. The SRAM speed was about 19TB/s, and the HBM speed was about 1.5–2.0TB/s.

The authors of Flash Attention have devised a method to partition computations in a way that allows intermediate results to fit into shared memory, enabling them to perform the entire attention computation with fewer trips to global memory. This is achieved by partitioning the input matrices into smaller tiles, performing calculations on these tiles (including matrix multiplications and softmax operations), and streaming the results back into global memory. The result is a significant speedup over the naive implementation.

Left: FlashAttention uses tiling to prevent materialization of the large 𝑁 × 𝑁 attention matrix (dotted box) on (relatively) slow GPU HBM. In the outer loop (red arrows), FlashAttention loops through blocks of the K and V matrices and loads them to fast on-chip SRAM. In each block, FlashAttention loops over blocks of Q matrix (blue arrows), loading them to SRAM, and writing the output of the attention computation back to HBM. Right: Speedup over the PyTorch implementation of attention on GPT-2. FlashAttention does not read and write the large 𝑁 × 𝑁 attention matrix to HBM, resulting in an 7.6× speedup on the attention computation. Illustration from the original paper.

Conclusion

The GPU programming landscape has changed dramatically over the past two decades. The introduction of CUDA and OpenCL has made GPU programming accessible to a much wider audience, triggering the deep learning revolution that, in turn, has changed the way we program GPUs. The array programming model has made it easier to write code that runs on the GPU, but it has also introduced new challenges, such as optimizing memory access patterns and fusing operations to achieve optimal performance.

Now, when you’re a certified GPU programming expert, enjoy the last meme and get your GPU cranking!

If you frequently run into this issue — check out my GPU rental service CloudRift.

Kernel Space

Surfacing a 60% performance bug in cuBLAS

The Headlines

What About cuBLASLt and Tensor Cores?

Where My Kernel Fits In

TMA vs LDGSTS: A Quick Primer

The TMA Double-Buffer Architecture

Compile-Time Specialization

Size-Adaptive Tile Selection

CTA Swizzle for L2 Cache Reuse

Batched mode

NCU Comparison with cuBLAS

Cross-checking Against the Headline Efficiencies

We Need to Go Deeper: Beyond PTX

Static Instruction Histograms

LDS-to-consumer Scheduling

Per-warp stall reasons

A note on the mbarrier.try_wait spin loop

Where is the Limit?

A Note on FP16 / BF16 — the Mainstream Path

Benchmark Methodology and Other Results

RTX PRO 6000 Blackwell (Max-Q)

H200

Related Bug Reports

Conclusion

Links

References

GPU Virtualization with VFIO, NVAI Enterprise, and AMD SR-IOV

The foundation: QEMU/KVM with libvirt

Machine type and PCI topology

CPU mode

PCIe root ports

ROM BAR

NVIDIA: full GPU passthrough via VFIO

Using host and guest GPUs simultaneously

Domain XML for VFIO passthrough

NVIDIA: fractional GPUs with MIG and vGPU

How MIG works

MIG device creation

Domain XML for MIG vGPU passthrough

NVIDIA AI Enterprise guest stack

MIG cleanup

Licensing costs

AMD: SR-IOV with GIM and ROCm

How AMD SR-IOV differs

Managed passthrough

ROCm guest stack

Fractional GPU allocation

Comparison

Closing thoughts

Host Setup for QEMU KVM GPU Passthrough with VFIO on Linux

1. Remove NVIDIA drivers

2. Check BIOS, IOMMU Support, and IOMMU Group Assignment

3. Leverage 1G Huge Pages

3.1 Check Huge Page Support

3.2 Allocate Huge Pages

3.3 Make Huge Pages Persistent

3.4 (Optional) Mount Huge Page Table

3.5 Configure your Virtualization Software to use Huge Pages

4. Bind to VFIO Early

4.1 Identify the PCI IDs to bind

4.2 Give VFIO first claim

4.3 Ensure VFIO is in the initramfs

4.4 Reboot and verify

5. Other GRUB Options

Conclusion

Evolution of GPU Programming

Smart Pixels

GPUs as General Purpose Computers

Enter the CUDA

CUDA Moat?

Deep Learning Revolution

The Array Programming Model

Why Programming GPUs is Hard?

I Love PyTorch! What Can Possibly be Wrong with It?

NVidia Domination

The AI Era — Bigger is Better

The Case of Flash Attention

Conclusion

A note on the `mbarrier.try_wait` spin loop