FitnessBench — coaching is reasoning. Can your AI model do it?

The finding

A calculator, not a coach.

Coaching is the same loop in every sport: read the athlete's numbers, reason, prescribe. We built that loop as questions — fitness from a race result, the pace it implies, the load that won't cause injury — with every answer computed from the literature and graded in code.

Models split cleanly. Single-formula lookups — BMI, one-rep-max, heart-rate zones — are nearly perfect. The moment a question needs several chained steps of reasoning, accuracy falls off a cliff. Today's models calculate. They don't yet coach.

Field-average accuracy by task type0–100%

Single-formula "plug-in" tasksBMI · 1RM · HR zones · energy · power-to-weight —

Multi-step reasoning tasksVDOT inference · training-pace prescription —

Averaged across every benchmarked model — frontier and open alike. No model clears 60% on training-pace prescription.

—

Best model's overall accuracy across six disciplines — flattered by the easy formulas every model can do.

—

Best model on the multi-step reasoning tasks — the actual coaching judgment, where the frontier stalls.

6 / 29

Disciplines & task types — including a multi-step tier — every answer computed from a named formula, scored at temperature 0.

The science we test

Real models behind every question.

FitnessBench doesn't ask for opinions. Each question is generated from an established exercise-science model, so the right answer is computed — and the grading is code.

Race-pace curverunning

Pace per km climbs predictably with distance — the Riegel/VDOT basis for prediction.

Heart-rate zonesphysiology

Five zones as a share of max HR; Karvonen sets the target bpm for each.

Workload sweet spotinjury

Acute:chronic load in the 0.8–1.3 band lowers injury risk; spikes push into the danger zone.

Coverage

Six disciplines. Every answer computed.

Wherever exercise science gives a verifiable answer, FitnessBench tests it — from race paces to power zones to injury workload. A named formula stands behind every question.

Running

Fitness, prescription and prediction from race results — plus training-load safety.

VDOTtraining pacerace predictionmileage safety

Cycling

Power-based training: zones, stress, sustainable thresholds and power-to-weight.

FTP zonesTSSW/kgcritical power

Swimming

Critical swim speed from time trials, pace, and CSS-based time prediction.

CSSswim pacetime prediction

Physiology

Heart-rate zones, basal and total energy expenditure, and body composition.

max HRKarvonenBMR / TDEEMET burnBMI

Strength

One-rep-max estimation and load prescription for a target rep range.

1RM estimateload prescription

Injury

Workload-spike risk, classic overuse-injury recognition, and evidence-based management.

ACWRinjury riskrecognitionPEACE & LOVE

Methodology

Why this number means something

Most benchmarks leak, saturate, or grade with another LLM. FitnessBench is built against each of those failure modes — the score is a measurement, not a vibe.

Four defenses

Computed ground truth — answers come from established models (Daniels & Gilbert, Coggan, Riegel, Mifflin-St Jeor, Gabbett), not opinion. Code-verified — responses are parsed and checked against a numeric tolerance, never by an LLM judge. Procedural — every question's numbers are randomized per seed, so nothing is memorizable. Correctness is the target — not a proxy, so there's no confound to game.

What it covers

Running — VDOT, paces, race prediction, mileage safety. Cycling — Coggan FTP zones, TSS, power-to-weight, critical power. Swimming — critical swim speed, pace, prediction. Physiology — max-HR, Karvonen, BMR/TDEE, MET burn, BMI. Strength — 1-rep-max and load prescription. Injury — acute:chronic workload ratio, overuse-injury recognition, and evidence-based acute management (PEACE & LOVE).

Leaderboard

Which model coaches best?

Overall accuracy across all six disciplines. Click any model to read the actual questions, its full reasoning, and where it got the science right or wrong.

Loading…

Cost & value

Accuracy you can afford to ship.

An AI coach runs on every workout, for every user. The right model is the one that's correct and cheap to serve. We fit accuracy against cost — the score is how far each model beats the price it charges.

Accuracy vs. cost$ per 1k questions →

Each dot is a model. The dashed line is the value frontier — models no other beats on both accuracy and cost.

Best valueaccuracy vs. its price

#	Model	$/1k Q	Acc	vs curve

For teams shipping AI coaching

You've decided the coach is the product. Can you prove it's right?

The public leaderboard tells you which base model to start from. The private benchmark tells you whether your coach — your prompts, your retrieval, your fine-tune — is actually correct, at a cost you can ship, and stops an upgrade from silently making it worse.

Pick a base model

Rank models by sport-science correctness and by cost-per-answer, broken out by discipline — so you don't ship on a model that can't prescribe a pace.

Benchmark your coach

Run your actual stack against the procedural question bank. Get a discipline-level scorecard against computed ground truth — not a focus group.

Gate every release

Wire FitnessBench into CI as a regression gate. Swap a model or change a prompt, and find out immediately if the coach got less correct.

Coaching is reasoning.