# The Graphics Pipeline

## GPU vs CPU

If you are a backend or systems developer, the GPU is a foreign piece of hardware. The CPU is a master chef: it cooks one dish at a time, can follow any recipe, handles exceptions mid-stream, and adapts to every condition. The GPU is a brigade of a thousand short-order cooks: each one does the same simple task on a different ingredient, cannot improvise, and cannot branch on its own. Together they process 1000× more ingredients — but only if the recipe is identical for every single one.

This is the throughput vs latency distinction. The CPU minimizes latency (finish one task fast). The GPU maximizes throughput (finish many identical tasks fast). This distinction drives every constraint and design choice in graphics programming.

## Why Not GPU For Everything

Branching kills parallelism. Inside a GPU warp (a group of 32-64 threads running the same shader), if even one thread takes a different branch, the entire warp serializes: it runs the first branch, then the second. Divergent logic stalls the whole unit. This is why `if/else` on varying data is expensive, and `match` is essentially banned shader design.

There is also PCIe transfer cost. Pushing megabytes of data to the GPU is relatively cheap — the bus was built for bulk transfers. Pulling results back, or transferring data back and forth per-frame, is a bottleneck you fight constantly.

The GPU also has no heap, no recursion, no `stdio`, and no arbitrary memory allocation. Every vertex shader invocation gets the same static stack. Every fragment shader invocation is stateless. You design around this, not against it.

## The Rendering Pipeline

Rendering maps 3D geometry to a 2D framebuffer through five stages:

```
Vertex Shader ──→ Primitive Assembly ──→ Rasterizer ──→ Fragment Shader ──→ Output Merge
once/vertex         groups→triangles          pixels/frag       once/fragment        depth/blend
```

Each stage is a pipeline filter. Data flows through; nothing flows backward. This is the hardware architecture of every GPU, from integrated Intel chips to RTX 5090s.

### Stage 1: Vertex Shader

[vertex shader](GLOSSARY.md#vertex-shader) — a GPU program running once per input [vertex](GLOSSARY.md#vertex).

Input: vertex attributes read from the [vertex buffer](GLOSSARY.md#vertex-buffer). In our case: position and color.

Output: mandatory clip-space position (`vec4<f32>`) plus any per-vertex data the [fragment shader](GLOSSARY.md#fragment-shader) needs downstream: color, UV coordinates, normals, etc.

The vertex shader is the only place you transform geometry. In complex scenes this means multiplying by model-view-projection matrices. For our triangle, the vertices are already in the GPU's native coordinate space, so the vertex shader passes the position through unchanged.

### Stage 2: Primitive Assembly

Hardware only. No user code runs here.

The GPU takes vertices in the order you submitted them and groups them into [primitive](GLOSSARY.md#primitive) shapes. With [topology](GLOSSARY.md#topology) set to `TriangleList`, every group of 3 consecutive vertices becomes one triangle. Vertex 0, 1, 2 → triangle A. Vertex 3, 4, 5 → triangle B.

### Stage 3: Rasterizer

> **Note:** The 5-stage model above is a simplification for conceptual clarity. The actual WebGPU and Vulkan pipelines define 11+ stages, including fixed-function vertex post-processing stages between the programmable stages. This section covers the essential stages relevant to writing shaders and configuring pipelines.

Before rasterization proper, the GPU performs several fixed-function vertex post-processing steps on the clip-space positions output by the vertex shader:

- **Perspective Division:** The clip-space `vec4` position is divided by its `w` component, converting it to normalized device coordinates (NDC) in the range [-1, 1].
- **Clipping:** Primitives that fall entirely outside the NDC cube are discarded. Primitives that partially intersect are clipped and retriangulated.
- **Viewport Transform:** NDC coordinates are mapped to window pixel coordinates based on the configured viewport dimensions.

These stages are automatic and happen in hardware. You do not write code for them.

Then the [rasterizer](GLOSSARY.md#rasterizer) takes over — the hardware stage that converts triangles into fragments.

For each submitted triangle, the rasterizer determines which screen pixels the triangle covers. For each covered pixel, it generates one [fragment](GLOSSARY.md#fragment) — a "potential pixel" carrying interpolated data.

The critical function here is [interpolation](GLOSSARY.md#interpolation). The rasterizer computes [barycentric coordinates](GLOSSARY.md#barycentric-coordinates) — three weights (w0, w1, w2) that sum to 1 — describing where inside the triangle the pixel falls. Then for every value the vertex shader output, the rasterizer computes: `value = w0 * value0 + w1 * value1 + w2 * value2`.

This is the step that makes colors blend across the triangle. It is free, automatic, hardware-accelerated [interpolation](GLOSSARY.md#interpolation). You do not write the code. The GPU computes it because it is how the rendering pipeline architecture works.

### Stage 4: Fragment Shader

[fragment shader](GLOSSARY.md#fragment-shader) — a GPU program running once per [fragment](GLOSSARY.md#fragment).

Input: the pre-interpolated values from the vertex shader, delivered by the rasterizer. The fragment shader receives one invocation per covered screen pixel. If a triangle covers 2000 pixels, the fragment shader runs 2000 times.

Output: the final RGBA color for that pixel. The fragment shader computes lighting, textures, and pixel-level effects. For our triangle, it receives the interpolated vertex color and returns it unchanged.

### Stage 5: Output Merge

The final hardware stage before the color hits the [framebuffer](GLOSSARY.md#framebuffer).

Per-fragment operations:

- **Depth test:** Compare the fragment's Z value against the depth buffer. Discard fragments behind already-drawn geometry. We disable this for our triangle — we only draw one primitive.
- **Stencil test:** Mask drawing to specific screen regions via a stencil buffer. We disable this.
- **Blend:** Combine the new fragment color with the existing framebuffer color. We use REPLACE — the fragment color overwrites whatever was there.

After the output merge, the final color is written to the framebuffer. When you [load op](GLOSSARY.md#loadop) is `Clear`, the framebuffer is filled with your background color before the render pass begins. [Storeop](GLOSSARY.md#storeop) determines whether you keep or discard the results after the render pass.

## Why This Matters For The Rainbow Triangle

The entire rainbow triangle effect flows from the pipeline architecture:

1. **Vertex shader** runs 3 times — once for each vertex. Each invocation outputs a position and a solid color: red, green, or blue.

2. **Primitive assembly** groups those 3 vertices into one triangle.

3. **Rasterizer** covers ~1000 screen pixels, generating 1000+ fragments. For each fragment, it interpolates the three vertex colors using barycentric weights. A pixel near the red vertex gets mostly red. A pixel in the center gets roughly equal parts of all three. This produces the gradient automatically.

4. **Fragment shader** runs 1000+ times — once per fragment. Each invocation receives the already-interpolated color and writes it to the output.

The rainbow gradient is not programmed. There is no loop, no formula, no color blending logic. The gradient is a direct consequence of the pipeline architecture: the rasterizer interpolates vertex shader outputs across the triangle surface, and the fragment shader passes the interpolated value through. You supply three colors at three corners, and the GPU fills in the continuum between them.

## The Pipeline Object In wgpu

In wgpu, you compile all of this into a [pipeline](GLOSSARY.md#pipeline): a single opaque render pipeline object encoding your shaders, topology, blend state, vertex layout, and output format. It is created once during initialization and reused every frame. Creating a pipeline up-front saves per-frame compilation and state configuration. The [device](GLOSSARY.md#device) owns the pipeline, and you use the [queue](GLOSSARY.md#queue) to submit draw calls that reference it.

The [adapter](GLOSSARY.md#adapter) is the physical GPU or software renderer you select. There may be multiple on a single system — a dedicated NVIDIA card plus integrated Intel graphics. You pick one adapter, create a device from it, and all resources flow from that device.

## A Note On The Pipeline Model

The five-stage model presented here is a simplified educational abstraction. The actual WebGPU and Vulkan graphics pipelines define 11 or more stages. Between the programmable stages lie fixed-function hardware stages including clipping, perspective division, and viewport transform that operate automatically. The five stages above capture the essential flow relevant to writing shaders and configuring pipelines for common use cases.

For the complete specification of the WebGPU rendering pipeline, consult the [WebGPU Specification](https://www.w3.org/TR/webgpu/).