Anyone can plug numbers into a formula. A coach reasons from them — reads the athlete, prescribes the work, prevents the injury. FitnessBench grades how well AI models do that, across six sports, against exercise science computed and checked in code. Models ace the formulas and stall on the reasoning.
Coaching is the same loop in every sport: read the athlete's numbers, reason, prescribe. We built that loop as questions — fitness from a race result, the pace it implies, the load that won't cause injury — with every answer computed from the literature and graded in code.
Models split cleanly. Single-formula lookups — BMI, one-rep-max, heart-rate zones — are nearly perfect. The moment a question needs several chained steps of reasoning, accuracy falls off a cliff. Today's models calculate. They don't yet coach.
Best model's overall accuracy across six disciplines — flattered by the easy formulas every model can do.
Best model on the multi-step reasoning tasks — the actual coaching judgment, where the frontier stalls.
Disciplines & task types — including a multi-step tier — every answer computed from a named formula, scored at temperature 0.
FitnessBench doesn't ask for opinions. Each question is generated from an established exercise-science model, so the right answer is computed — and the grading is code.
Pace per km climbs predictably with distance — the Riegel/VDOT basis for prediction.
Five zones as a share of max HR; Karvonen sets the target bpm for each.
Acute:chronic load in the 0.8–1.3 band lowers injury risk; spikes push into the danger zone.
Wherever exercise science gives a verifiable answer, FitnessBench tests it — from race paces to power zones to injury workload. A named formula stands behind every question.
Fitness, prescription and prediction from race results — plus training-load safety.
Power-based training: zones, stress, sustainable thresholds and power-to-weight.
Critical swim speed from time trials, pace, and CSS-based time prediction.
Heart-rate zones, basal and total energy expenditure, and body composition.
One-rep-max estimation and load prescription for a target rep range.
Workload-spike risk, classic overuse-injury recognition, and evidence-based management.
Most benchmarks leak, saturate, or grade with another LLM. FitnessBench is built against each of those failure modes — the score is a measurement, not a vibe.
Computed ground truth — answers come from established models (Daniels & Gilbert, Coggan, Riegel, Mifflin-St Jeor, Gabbett), not opinion. Code-verified — responses are parsed and checked against a numeric tolerance, never by an LLM judge. Procedural — every question's numbers are randomized per seed, so nothing is memorizable. Correctness is the target — not a proxy, so there's no confound to game.
Running — VDOT, paces, race prediction, mileage safety. Cycling — Coggan FTP zones, TSS, power-to-weight, critical power. Swimming — critical swim speed, pace, prediction. Physiology — max-HR, Karvonen, BMR/TDEE, MET burn, BMI. Strength — 1-rep-max and load prescription. Injury — acute:chronic workload ratio, overuse-injury recognition, and evidence-based acute management (PEACE & LOVE).
Overall accuracy across all six disciplines. Click any model to read the actual questions, its full reasoning, and where it got the science right or wrong.
An AI coach runs on every workout, for every user. The right model is the one that's correct and cheap to serve. We fit accuracy against cost — the score is how far each model beats the price it charges.
| # | Model | $/1k Q | Acc | vs curve |
|---|
The public leaderboard tells you which base model to start from. The private benchmark tells you whether your coach — your prompts, your retrieval, your fine-tune — is actually correct, at a cost you can ship, and stops an upgrade from silently making it worse.
Rank models by sport-science correctness and by cost-per-answer, broken out by discipline — so you don't ship on a model that can't prescribe a pace.
Run your actual stack against the procedural question bank. Get a discipline-level scorecard against computed ground truth — not a focus group.
Wire FitnessBench into CI as a regression gate. Swap a model or change a prompt, and find out immediately if the coach got less correct.
Benchmark your model — or your whole coaching stack — on computed exercise science.