⚙ Hardware ◈ Models ◎ Forecast

The hardware wall

Raw compute (FLOPs/$) improves quickly. Memory bandwidth and VRAM capacity — the resources that actually gate inference speed and model size — improve much slower, and the consumer tier is falling further behind datacenters every year.

Bandwidth gap

7×

Consumer (~1.8 TB/s) vs Rubin (13 TB/s) in 2026

VRAM gap

9×

Consumer 32 GB vs Rubin 288 GB

Consumer GPU freeze

2 yrs

No new Nvidia consumer card until ~2028 (RTX 60 series)

Bandwidth per $ trend

Flat

$/GB/s has barely improved at the flagship tier since 2020

The bandwidth race

Consumer and datacenter GPU memory bandwidth in GB/s. Both improve over time — but the datacenter curve is accelerating while consumer is stalling. The gap that was 3× in 2022 is over 7× by 2026.

2022 inflection point. The H100 SXM launched with 3,350 GB/s — 3.3× the RTX 4090's 1,008 GB/s. Before this, the gap was more modest. After this, it widens fast.

2026: the freeze bites. Nvidia will ship no new consumer GPU in 2026. DRAM supply is constrained by AI chip demand. Meanwhile the Rubin datacenter GPU arrives at 13,000 GB/s. The consumer-to-datacenter ratio hits its worst level ever.

Why this matters for local inference. Decoding tokens is bandwidth-bound, not compute-bound. Each token requires reading all the model's active weights from memory. Speed scales almost linearly with bandwidth, not with FLOP count.

Bandwidth per dollar: stagnant at the top

Lower $/GB/s means better value. The RTX 3080 in 2020 was actually better value than anything Nvidia shipped in the four years after it. The RTX 5090 only just returns to 2020 levels — and only at MSRP, which is rarely the street price.

Consumer flagship $/GB/s at launch

Lower = better value. Dotted = street price (typically higher than MSRP).

VRAM capacity (GB)

Consumer VRAM was flat at 24 GB for five years.

The RTX 3080 anomaly. In 2020, Nvidia cut the price while boosting bandwidth. The $699 RTX 3080 at $0.92/GB/s was better value than anything that followed. The lesson: don't assume bandwidth per dollar always improves.

Consumer VRAM is a brick wall. The flagship consumer GPU had 24 GB from 2020 (RTX 3090) through 2024 (RTX 4090). The RTX 5090 raised this to 32 GB in 2025 — but at $3,200–3,500 street price, the VRAM-per-dollar regression is severe.

What speed can you expect today?

RTX 4090 (24 GB, 1,008 GB/s) is the realistic anchor of a $3.5–5k build today. Speed is primarily determined by active parameter count — which is why MoE models are a qualitative shift.

Model	Type	VRAM (Q4)	Active params	Speed on 4090	Fits in $5k box?
Qwen2.5-7B	Dense	~5 GB	7B	~110 tok/s	Yes
Qwen2.5-14B	Dense	~9 GB	14B	~80 tok/s	Yes
Qwen3-30B-A3B	MoE (3B active)	~18 GB	3B active	~100–130 tok/s	Yes
Gemma 3 27B	Dense	~16 GB	27B	~35–45 tok/s	Yes
Qwen2.5-32B	Dense	~20 GB	32B	~30–40 tok/s	Yes (tight)
Llama 3.3 70B	Dense	~38 GB	70B	~42–52 tok/s	No — needs 2× 4090
Qwen3-235B-A22B	MoE (22B active)	~130 GB	22B active	~25–40 tok/s	No — needs ~6× 4090

FLOPs per dollar vs. bandwidth per dollar

This is the key divergence. FLOPs per dollar (what gets quoted in trend analyses) improves quickly. Bandwidth per dollar — what actually gates inference speed — does not. Treating the "2.1 year doubling" as if it applies to inference is the most common mistake in forecasting local model deployment.

The FLOP number is a trap for inference forecasting. FP16 FLOP/$ doubles roughly every 2.1 years. But a single forward pass on a 30B model needs roughly 60 GB of weight data moved from memory — and bandwidth isn't keeping pace. Buying more FLOPs doesn't help if you can't feed the GPU fast enough.

MoE is the partial escape hatch. By activating only 3B of 30B parameters per token, Qwen3-30B-A3B only needs to transfer ~6 GB of weights per forward pass instead of ~60 GB. This brings the bandwidth requirement back to something consumer hardware can handle. But it doesn't help with long contexts, where KV cache growth adds its own bandwidth pressure.