Principles of Speed: More GPU Memory Utilisation ≠ Better Performance

AI GPUs

I’ve started training LLMs. I’ve got some background in using GPUs for deep learning, HPC but most of that was using GNNs which are relatively lightweight. Most of my optimisations were focused on optimising pipelines, data loading, pytorch operations and my objective was to benchmark different models. Right now, I’m working with more parameters, more compute power, more data. But I still want speed and efficiency. This is a lil blog on what I’ve learnt about GPU optimisation with LLMs.

Firstly, read this brrr_intro by Horace He. Honestly, it’s a probably (definitely) more concise and better written version of what I’m trying to say here. But I wanted to write it out for myself… something something about an exercise in understanding. Some other good resources for practical optimisations are this neptune blog and this pytorch blog.

🎯 The GPU Memory Sweet Spot Theory

The max performance of a GPUs is determined by two factors:

  • Compute (FLOPS): How fast they can do math
  • Memory Bandwidth: How fast they can move data

I’ve been using a NVIDIA A100 40GB. There are better and worse GPUs out there but the principles of speed are the same. The A100 has:

  • 312 TFLOPS (bfloat16)
  • 1.5 TB/s memory bandwidth

The key ratio is Arithmetic Intensity = FLOPS / Bandwidth = ~200 operations/byte. Which means that the A100 can do 200 operations for every byte it can move from memory. This is a very high ratio, which is great for workloads that do a lot of math per byte moved. But it also means that if your workload doesn’t have high arithmetic intensity, you can easily become memory-bound. This means the GPU can do things quicker than it can get the data moved in and out of memory. The opposite is compute-bound, where the GPU is doing math as fast as it can, and memory bandwidth is not a bottleneck.

Why this matters for transformer models:

  • Transformers need ~2-4 ops/byte (loading weights, doing math, storing results)
  • The GPU can do 200 ops/byte

This massive gap means that the GPU can easily become memory-bound. We want to find a balance between compute and memory bandwidth to maximize operation throughput.GPU RAM usage is a proxy for how much data you’re moving. If the RAM is too high then you’re likely memory-bound. If it’s too low, you’re likely compute-bound but underutilizing the GPU. You want to find a balance where the GPU is constantly moving data and doing math without either being a bottleneck or underutilized.

The Memory Wall Problem

The “Memory Wall” refers to the growing disparity between processor speed and memory speed:

Year    Compute Growth    Memory BW Growth    Gap
1990    1x               1x                  1x
2000    100x             10x                 10x
2010    10,000x          100x                100x
2020    1,000,000x       1,000x              1,000x

This exponential gap means modern GPUs can compute far faster than they can fetch data. The solution? Keep the compute units fed with data already in fast caches, not waiting on slow memory transfers.

Understanding GPU Architecture Components

The GPU Hierarchy - From Big to Small

┌─────────────────────── GPU (A100) ───────────────────────┐
│                                                          │
│  ┌─────────────────── GPC (7 total) ────────────────────┐│
│  │                                                      ││
│  │  ┌──── TPC ─────┐   ┌──── TPC ────┐                  ││
│  │  │              │   │             │                  ││
│  │  │ ┌─SM─┐ ┌─SM─┐│   │┌─SM─┐ ┌─SM─┐│  (2*TPC per GPC) ││
│  │  │ │    │ │    ││   ││    │ │    ││  +(2*SM per TPC) ││
│  │  │ └────┘ └────┘│   │└────┘ └────┘│  = 108 SMs total ││
│  │  └──────────────┘   └─────────────┘                  ││
│  └──────────────────────────────────────────────────────┘│
│                                                          │
│  ┌────────────── Memory Subsystem ────────────────────┐  │
│  │  L2 Cache: 40 MB (shared by all SMs)               │  │
│  │  HBM2e: 40 GB (1.5 TB/s bandwidth)                 │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

What Each Component Does

  • GPC (Graphics Processing Cluster): Top-level organization unit that manages work distribution
  • TPC (Texture Processing Cluster): Mid-level compute unit that handles texture operations and compute workloads
  • SM (Streaming Multiprocessor): The fundamental compute unit where actual work happens

Inside a Single SM - Where the Magic Happens

┌──────────────── Streaming Multiprocessor (SM) ────────────────┐
│                                                               │
│  Warp Schedulers (4x)                                         │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                          │
│  │Sched1│ │Sched2│ │Sched3│ │Sched4│  → Dispatch instructions │
│  └──────┘ └──────┘ └──────┘ └──────┘                          │
│                                                               │
│  Execution Units:                                             │
│  ┌────────────────────────────────────┐                       │
│  │ 64 FP32 CUDA Cores (INT32 capable) │ ← Basic math ops      │
│  │ 32 FP64 CUDA Cores                 │ ← Double precision    │
│  │ 4 Tensor Cores (3rd gen)           │ ← Matrix multiply     │
│  │ 16 Load/Store Units                │ ← Memory access       │
│  │ 4 Special Function Units (SFU)     │ ← Transcendentals     │
│  └────────────────────────────────────┘                       │
│                                                               │
│  Local Memory:                                                │
│  ┌────────────────────────────────────┐                       │
│  │ Register File: 256 KB              │ ← Fastest (1 cycle)   │
│  │ L1/Shared Memory: 192 KB           │ ← Fast (30 cycles)    │
│  │ Constant Cache: 64 KB              │ ← Read-only cache     │
│  └────────────────────────────────────┘                       │
└───────────────────────────────────────────────────────────────┘

Component Responsibilities

Warp Schedulers

  • What they do: Pick which warp executes next
  • Capability: Each can dispatch 1 instruction per cycle
  • Strategy: Hide memory latency by switching between warps
  • Key insight: With 4 schedulers and 64 warps max, you need 16+ warps per scheduler for good occupancy

CUDA Cores

  • What they do: Execute basic arithmetic (add, multiply, etc.)
  • FP32 cores: Handle single-precision floating point and integers
  • FP64 cores: Handle double-precision (half the throughput)
  • Key insight: These are simple ALUs, not full CPU cores

Tensor Cores

  • What they do: Accelerate matrix multiply-accumulate (MMA) operations
  • Performance: 312 TFLOPS vs 19.5 TFLOPS without them
  • Operations: D = A×B + C in a single operation
  • Key insight: Essential for transformer models - why A100 beats older GPUs

Load/Store Units

  • What they do: Move data between registers and memory
  • Bottleneck: Often the limiting factor in memory-bound kernels
  • Capability: 16 units × 32 bytes = 512 bytes per cycle per SM
  • Key insight: When these stall, your whole SM stalls

Special Function Units (SFU)

  • What they do: Compute transcendental functions (sin, cos, exp, log)
  • Performance: Slower than basic ops but hardware-accelerated
  • Usage: Critical for activation functions like GELU, softmax
  • Key insight: Limited count means these can become bottlenecks

The Memory Hierarchy in Context

Distance from SM:
┌─────────────────────────────────────────────┐
│ Registers          │ 0 hops  │ 1 cycle      │ ← Thread-private
│ L1/Shared Memory   │ 0 hops  │ ~30 cycles   │ ← Block-shared
│ L2 Cache           │ 1 hop   │ ~200 cycles  │ ← Global-shared
│ HBM2e              │ 2 hops  │ ~450 cycles  │ ← Device memory
│ System RAM         │ 3 hops  │ ~10K cycles  │ ← Via PCIe
└─────────────────────────────────────────────┘

How Work Gets Executed

  1. Thread: Single execution unit (like one vector element)
  2. Warp: 32 threads that execute in lockstep (SIMD style)
  3. Block: Collection of warps (up to 1024 threads)
  4. Grid: Collection of blocks (your entire kernel launch)
Grid of Blocks:
┌───┬───┬───┬───┐
│B0 │B1 │B2 │B3 │  Each block assigned to an SM
├───┼───┼───┼───┤  Multiple blocks can share an SM
│B4 │B5 │B6 │B7 │  SMs execute blocks independently
└───┴───┴───┴───┘

Inside Block B0:
┌─────────────────────────┐
│ Warp 0: T0-T31          │  All threads in a warp
│ Warp 1: T32-T63         │  execute the same instruction
│ Warp 2: T64-T95         │  on different data (SIMD)
│ Warp 3: T96-T127        │
└─────────────────────────┘

How Work Flows Through the GPU

When you launch a kernel:

  1. Grid of thread blocks is created (your entire workload)
  2. Blocks get distributed to available SMs by the GPC
  3. Warps (32 threads) within each block execute on the SM
  4. Threads in a warp execute the same instruction on different data (SIMD)

The GPU’s scheduler automatically distributes blocks across all 108 SMs, balancing the workload.

Why Bigger Batches Hit Diminishing Returns

Small batch (memory: 20%) - Underutilized hardware

  • SM occupancy too low - many compute units sit idle
  • Not enough warps to hide even normal memory latency
  • Each SM gets only 1-2 blocks instead of optimal 4-8 blocks
Time →    0    1    2    3    4    5    6    7    8    9   10   11
Warp 1:  [C1 ][---Wait-Wait---][C1 ][---Wait-Wait---][C1 ]
Warp 2:       [C2 ][---Wait-Wait---][C2 ][---Wait-Wait---]
(No more warps available - small batch exhausted)
SM Unit: [C1 ][C2 ][idle][idle][C1 ][C2 ][idle][idle][C1 ][C2 ]
Result:  ██████████░░░░░░░░░░░░██████████░░░░░░░░░░░░██████████ (45% util)

Why the wait? Even with 20% memory, warps still need to:
- Load weights from L2/HBM (~200 cycles)
- Load activations and intermediate values
- Store results back to memory
These are normal, unavoidable memory operations. The problem is we only have 2 warps because of the small batch size, so when both are waiting, the SM goes idle.

Optimal batch (memory: 33%) - Perfect balance

  • High SM occupancy - compute units fully utilized
  • Sufficient warps for complete latency hiding
  • Memory bandwidth not saturated (~80% of 1.5 TB/s)
  • Register pressure low, allowing more concurrent warps
  • Tensor cores stay fed with continuous data
Time →    0    1    2    3    4    5    6    7    8    9   10   11
Warp 1:  [C1 ][--Wait-Wait--][C1 ][--Wait-Wait--][C1 ][--Wait-Wait
Warp 2:       [C2 ][--Wait-Wait--][C2 ][--Wait-Wait--][C2 ][--Wait
Warp 3:            [C3 ][--Wait-Wait--][C3 ][--Wait-Wait--][C3 ]
Warp 4:                 [C4 ][--Wait-Wait--][C4 ][--Wait-Wait--]
SM Unit: [C1 ][C2 ][C3 ][C4 ][C1 ][C2 ][C3 ][C4 ][C1 ][C2 ][C3 ]
Result: ██████████████████████████████████████████████████████████
        Continuous compute - while one waits, others execute

Large batch (memory: 50%+) - Memory bottleneck

  • Memory bandwidth saturated (hitting 1.5 TB/s limit)
  • Cache thrashing - L2 evicts data before reuse
  • High register/shared memory use limits concurrent warps
  • Load/Store units bottlenecked, tensor cores starve
Time →    0    1    2    3    4    5    6    7    8    9   10   11
Warp 1:  [C1 ][----------Long Memory Wait----------][C1 ]
Warp 2:       [C2 ][----------Long Memory Wait----------][C2 ]
SM Unit: [C1 ][C2 ][....stall....stall....stall....][C1 ][C2 ]
Result:  ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████████
        Gaps in compute - not enough warps to hide latency

Why the long wait? Double whammy at 50%+ memory:
1. Fewer warps fit (2 instead of 4) due to high register/shared memory usage
2. Each warp waits 3x longer (600+ cycles instead of 200) due to memory bandwidth saturation

Key insight: At 50%+ memory, you get hit from both sides - fewer warps AND longer waits per warp, creating massive stalls in the SM unit.

The irony: using more memory actually gives you worse performance because you can’t fit enough warps to hide the increased memory latency.

Summary

The Highway Analogy

Think of GPU memory like a highway:

  • 0% full: Empty road, cars (compute) idle
  • 15% full: Few cars, some idle lanes
  • 33% full: Traffic flows at maximum speed
  • 50% full: Starting to slow down
  • 80% full: Stop-and-go traffic
  • 100% full: Gridlock

The maximum throughput (cars × speed) happens around 30-40% (ish) capacity, not at 100%!

Key Takeaways

  1. Memory ≠ Performance: More memory usage doesn’t mean better performance
  2. Balance is Key: Optimal performance occurs when compute is saturated but memory bandwidth isn’t
  3. Architecture Matters: Different GPUs have different sweet spots based on their compute:bandwidth ratio
  4. Workload Specific: The exact optimal point depends on your model architecture and sequence length

The lesson: More memory usage ≠ better performance. Find the point where compute is saturated but memory bandwidth isn’t bottlenecked - so that neither compute nor memory bandwidth becomes the bottleneck.