Stanford Just Made LLM Scaling Laws 99% Cheaper

Training a frontier LLM costs anywhere from hundreds of millions to over a billion dollars per run. Before committing to that bet, developers use scaling laws to predict whether the model will actually get smarter as it gets bigger. There's just one problem: estimating those scaling laws is itself a massive computational exercise, requiring tens of thousands of model evaluations across tens of thousands of benchmark questions. Stanford researchers just found a way to cut that cost by 99 percent.

The Scaling Law Bottleneck

Scaling laws are the compass of modern AI development. They tell you how performance improves with more compute and more data. Without them, you are flying blind. That is why every major lab from OpenAI to Google to Meta has teams dedicated to estimating them.

The trouble is that traditional estimation is brutally expensive. It requires training hundreds of smaller models at various scales, then running each of them against thousands of benchmark questions. The Stanford researchers cite cases where this involves up to 10 trillion individual queries across model-benchmark pairs. Parameter complexity grows as O(M × N); every additional model you test against every additional question multiplies the compute needed.

A 2024 study by Biderman et al. found that even mid-scale scaling law experiments can consume GPU-months of compute, putting rigorous scaling analysis out of reach for all but the best-funded labs.

Item Response Scaling Laws (IRSL)

The Stanford team, Sang Truong, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo, borrowed a trick from standardized testing. Item Response Theory (IRT) is what powers the SAT and other adaptive exams: instead of asking every student every question, it models each student's latent ability and each question's difficulty separately, then uses statistical inference to predict performance on unseen items.

IRSL applies the same logic to LLM evaluation. It factorizes the problem from O(M × N) down to O(M + N) by disentangling model ability from question characteristics. The key innovation is Beta-IRT, an extension that works with empirical probability responses (token probabilities during pre-training, pass rates during test-time sampling) rather than simple binary pass/fail signals.

The results are striking. While traditional methods might require 10,000+ questions to estimate scaling behavior, IRSL delivers equivalent accuracy using as few as 50 carefully selected questions. That is a reduction of more than 99 percent in computational demand.

Validation at Scale

The team validated IRSL across two major scaling paradigms:

Pre-training downstream scaling: 6,612 LLM checkpoints from 6 model families, tested across 37,682 questions from 10 benchmarks. IRSL matched or exceeded the predictive accuracy of traditional methods while using a fraction of the queries.

Test-time scaling: 12 LLMs, 120 questions from 4 benchmarks, with up to 2,500 samples per question. Again, IRSL produced reliable scaling estimates with dramatically less computation.

The paper was accepted at ICML 2026, one of the top machine learning conferences, and is also under review at ICLR 2026.

What This Means

The most immediate implication is that this makes scaling analysis accessible to more teams, not just the biggest labs. Right now, running this kind of analysis is something only the largest labs can afford. Academic groups and smaller companies have to guess, or rely on published scaling curves that may not generalize to their setup. IRSL changes that. If you can run a few hundred targeted evaluations instead of millions, scaling analysis becomes accessible to any team with a GPU.

Sanmi Koyejo, the senior author, put it in context: "Before scaling laws were proven, the best-known developers gambled and bet the farm on them. They made big strategic decisions about how to tweak and design their models and used scaling laws to extrapolate performance, and they were right." Now those same predictions cost 99 percent less to make.

The counterintuitive finding is that doing less computational work can produce better predictions. That has broader implications. It suggests that the field has been overfitting its scaling estimates with brute-force computation, when a statistically smarter approach could have been delivering the same insights all along.

Limitations

IRSL is not a magic wand. It reduces the cost of estimating scaling laws, not the cost of training models. And its accuracy depends on the quality of the benchmark questions used. Garbage in, garbage out applies here as much as anywhere in ML. The paper also notes that Beta-IRT's performance degrades when the empirical probability signal is noisy, which can happen with very small sample sizes.

The framework also makes assumptions about question independence that might not hold across all evaluation settings. The researchers acknowledge this and suggest ensemble-based extensions as future work.

Sources

Stanford HAI: New Approach to Scaling Laws Could Change How AI Models Are Trained (May 21, 2026): https://hai.stanford.edu/news/new-approach-to-scaling-laws-could-change-how-ai-models-are-trained
ICML 2026 Poster: Item Response Scaling Laws: https://icml.cc/virtual/2026/poster/64176
OpenReview: Item Response Scaling Laws: https://openreview.net/forum?id=pIfopX18D1
Digital Watch Observatory: New Stanford scaling method could make AI training cheaper: https://dig.watch/updates/new-stanford-scaling-method-could-make-ai-training-cheaper
Biderman et al. (2023) on pre-training downstream scaling: https://arxiv.org/abs/2305.08842
Brown et al. (2024) on test-time scaling: referenced in IRSL paper

What Surprised Me

The 99 percent reduction is the headline, but the thing I keep coming back to is the parameter complexity shift. O(M × N) to O(M + N) is the kind of improvement that feels like it should have been obvious in hindsight — like why were we treating every model-benchmark pair as an independent experiment when psychometrics figured out the shared structure decades ago?

There is something vaguely embarrassing about the fact that AI evaluation is just now discovering statistical techniques from the SAT. The field's hunger for brute-force compute sometimes blinds it to smarter approaches. IRSL is not going to save anyone a billion dollars on training runs. But it might make the planning that leads to those runs a lot less wasteful.

The Scaling Law Bottleneck

Item Response Scaling Laws (IRSL)

Validation at Scale

What This Means

Limitations

Sources

What Surprised Me

RELATED_ENTRIES

That 27B model was too big for a phone. Not anymore.

$4.40 per million tokens just matched the $200 tier

AI coding costs hit $2,000 per engineer and budgets blew up