⚙ Hardware ◈ Models ◎ Forecast

The hardware wall

Raw compute (FLOPs/$) improves quickly. Memory bandwidth and VRAM capacity — the resources that actually gate inference speed and model size — improve much slower, and the consumer tier is falling further behind datacenters every year.

Bandwidth gap
Consumer (~1.8 TB/s) vs Rubin (13 TB/s) in 2026
VRAM gap
Consumer 32 GB vs Rubin 288 GB
Consumer GPU freeze
2 yrs
No new Nvidia consumer card until ~2028 (RTX 60 series)
Bandwidth per $ trend
Flat
$/GB/s has barely improved at the flagship tier since 2020

The bandwidth race

Consumer and datacenter GPU memory bandwidth in GB/s. Both improve over time — but the datacenter curve is accelerating while consumer is stalling. The gap that was 3× in 2022 is over 7× by 2026.

2022 inflection point. The H100 SXM launched with 3,350 GB/s — 3.3× the RTX 4090's 1,008 GB/s. Before this, the gap was more modest. After this, it widens fast.
2026: the freeze bites. Nvidia will ship no new consumer GPU in 2026. DRAM supply is constrained by AI chip demand. Meanwhile the Rubin datacenter GPU arrives at 13,000 GB/s. The consumer-to-datacenter ratio hits its worst level ever.
Why this matters for local inference. Decoding tokens is bandwidth-bound, not compute-bound. Each token requires reading all the model's active weights from memory. Speed scales almost linearly with bandwidth, not with FLOP count.

Bandwidth per dollar: stagnant at the top

Lower $/GB/s means better value. The RTX 3080 in 2020 was actually better value than anything Nvidia shipped in the four years after it. The RTX 5090 only just returns to 2020 levels — and only at MSRP, which is rarely the street price.

Consumer flagship $/GB/s at launch
Lower = better value. Dotted = street price (typically higher than MSRP).
VRAM capacity (GB)
Consumer VRAM was flat at 24 GB for five years.
The RTX 3080 anomaly. In 2020, Nvidia cut the price while boosting bandwidth. The $699 RTX 3080 at $0.92/GB/s was better value than anything that followed. The lesson: don't assume bandwidth per dollar always improves.
Consumer VRAM is a brick wall. The flagship consumer GPU had 24 GB from 2020 (RTX 3090) through 2024 (RTX 4090). The RTX 5090 raised this to 32 GB in 2025 — but at $3,200–3,500 street price, the VRAM-per-dollar regression is severe.

What speed can you expect today?

RTX 4090 (24 GB, 1,008 GB/s) is the realistic anchor of a $3.5–5k build today. Speed is primarily determined by active parameter count — which is why MoE models are a qualitative shift.

Model Type VRAM (Q4) Active params Speed on 4090 Fits in $5k box?
Qwen2.5-7B Dense ~5 GB 7B ~110 tok/s Yes
Qwen2.5-14B Dense ~9 GB 14B ~80 tok/s Yes
Qwen3-30B-A3B MoE (3B active) ~18 GB 3B active ~100–130 tok/s Yes
Gemma 3 27B Dense ~16 GB 27B ~35–45 tok/s Yes
Qwen2.5-32B Dense ~20 GB 32B ~30–40 tok/s Yes (tight)
Llama 3.3 70B Dense ~38 GB 70B ~42–52 tok/s No — needs 2× 4090
Qwen3-235B-A22B MoE (22B active) ~130 GB 22B active ~25–40 tok/s No — needs ~6× 4090

No new consumer GPU in 2026. Nvidia confirmed it will not launch an RTX 50 Super or new consumer series in 2026 due to DRAM supply constraints prioritising AI chips. Micron's CEO stated memory markets will "remain tight past 2026." The RTX 60 series is not expected until ~2028, leaving this table essentially unchanged for two years.

FLOPs per dollar vs. bandwidth per dollar

This is the key divergence. FLOPs per dollar (what gets quoted in trend analyses) improves quickly. Bandwidth per dollar — what actually gates inference speed — does not. Treating the "2.1 year doubling" as if it applies to inference is the most common mistake in forecasting local model deployment.

The FLOP number is a trap for inference forecasting. FP16 FLOP/$ doubles roughly every 2.1 years. But a single forward pass on a 30B model needs roughly 60 GB of weight data moved from memory — and bandwidth isn't keeping pace. Buying more FLOPs doesn't help if you can't feed the GPU fast enough.
MoE is the partial escape hatch. By activating only 3B of 30B parameters per token, Qwen3-30B-A3B only needs to transfer ~6 GB of weights per forward pass instead of ~60 GB. This brings the bandwidth requirement back to something consumer hardware can handle. But it doesn't help with long contexts, where KV cache growth adds its own bandwidth pressure.