CUDA Memory Coalescing: The First Thing to Get Right

If there's one thing that separates fast GPU kernels from slow ones, it's memory access patterns. You can have the most brilliant algorithm, but if your threads are reading memory in a scattered pattern, you're leaving 90% of your bandwidth on the table.

The basics

Global memory on NVIDIA GPUs is accessed in 32-byte, 64-byte, or 128-byte transactions. When threads in a warp access consecutive memory addresses, the hardware coalesces these into a single transaction.

// Good: coalesced access — thread i reads element i
float val = input[threadIdx.x + blockIdx.x * blockDim.x];

// Bad: strided access — thread i reads element i * stride
float val = input[(threadIdx.x + blockIdx.x * blockDim.x) * stride];

Measuring the difference

On an H100, coalesced global memory reads achieve ~3.35 TB/s. Strided access with a stride of 32 drops this to roughly ~100 GB/s. That's a 33x performance difference from access patterns alone.

Takeaway

Before optimizing compute, profile your memory access patterns. Tools like ncu (Nsight Compute) will show you exactly how efficient your loads and stores are. Start there.