⚙ Hardware ◈ Models ◎ Forecast

The open-source quality surge

The gap between closed frontier models and the best self-hostable alternatives collapsed from 17.5 MMLU points in late 2023 to essentially zero by late 2025. And MoE architectures mean the best open models now run at API-competitive speeds on consumer hardware.

MMLU gap closed

17.5 → 0.3

Percentage point gap from frontier to best open model, 2023 to 2025

Current local ceiling

GPT-4o tier

Qwen3-30B-A3B at ArenaHard 91 vs GPT-4o 85.3, fits in 24 GB

Lag vs current frontier

~1.5–2 yrs

Local ceiling ≈ mid-2024 frontier (GPT-4o, May 2024)

Frontier vs open-source quality gap, 2022–2025

MMLU isn't a perfect quality metric, but it tracks the broad trend well. The gap widened initially as GPT-4 launched well ahead of open alternatives, then closed rapidly through 2024–2025.

The gap first widened. GPT-4's March 2023 launch jumped the frontier to ~86% MMLU. Open-source took over a year to follow. The lag was worst at ~17.5 points in late 2023.

Qwen and DeepSeek changed the pace. Alibaba's Qwen2.5 and Moonshot's DeepSeek-R1 (Jan 2025, $5.9M training cost) proved that efficient training methods could rapidly close the compute gap. By late 2025 the MMLU gap was effectively zero.

MMLU parity doesn't mean product parity. Hard agentic benchmarks (SWE-bench, AIME) and long-context reliability still show gaps. But the quality-per-dollar trajectory is now genuinely impressive.

Model explorer: what runs on a $5k box?

Filter by what fits, and see the quality-speed-VRAM tradeoff. The MoE models are the standout shift: GPT-4o-equivalent quality at dense-7B speeds.

Show:

Why MoE changes the inference equation

In a dense model, every token activates all parameters. In a sparse Mixture-of-Experts model, each token routes to only a few selected experts — activating a small fraction of total weights. This dramatically reduces bandwidth needed per token.

Dense 30B model

Every token reads all 30B parameters from VRAM.

Parameters activated per token30B / 30B

Memory transferred per token: ~60 GB
Speed on RTX 4090 (~1 TB/s): ~35–45 tok/s

Qwen3-30B-A3B (MoE)

Each token activates only 3B active parameters (10%). Expert routing selects the best 3B for each token.

Parameters activated per token3B / 30B

Memory transferred per token: ~6 GB
Speed on RTX 4090 (~1 TB/s): ~100–130 tok/s

Animated: token routing to experts

The catch. All 30B of expert weights still need to live somewhere accessible — VRAM or fast RAM. The total model footprint doesn't shrink. What shrinks is the bandwidth consumed per token. Loading from CPU RAM to GPU on demand (expert offloading) is possible but kills the speed advantage.

KV cache grows regardless. For long contexts, the key-value cache grows linearly with sequence length and must be read every token. A 200K-token context with a large model can mean hundreds of GB of KV cache — which brings bandwidth pressure back. MoE helps most for short-context chat and coding.

Quality vs. VRAM required

Each point is a model. The vertical dashed line shows the 24 GB RTX 4090 limit. Models to the left fit; models to the right require more hardware. Note how the MoE models (circles) break the expected quality/VRAM correlation.