How Arvo's Senior Reviewer AI catches bad workouts before they ship

An 8-guardrail deterministic layer plus a gpt-5 sub-agent reviews every AI-generated workout before the user sees it. Here's the architecture, the checks, and the numbers.

Alex Moretti
9 min read
April 2026
EngineeringAI SafetyArchitecture

How does Arvo prevent bad AI workouts?

Arvo runs an 8-guardrail deterministic review plus a gpt-5 sub-agent on every generated workout. It catches volume explosions, hallucinated exercise names, equipment mismatches, and injury conflicts before the user sees them. It passes 30/30 fixtures and 5/5 real production cases.

TL;DR

  • Every AI-generated workout passes through 8 deterministic guardrails in TypeScript before reaching the user.
  • A gpt-5 sub-agent runs on the recovery path of ExerciseSelector when the primary generation is borderline.
  • Guardrails cover volume caps, hallucinated exercise names, equipment mismatches, injury conflicts, duplicates, missing primary muscles, invalid rep ranges, and incompatible supersets.
  • Passes 30/30 test fixtures plus 5/5 real production edge cases that previously caused user-visible bugs.
  • A weekly Inngest cron aggregates reviewer invocations into an admin digest, giving us a fast feedback loop on drift.

The problem: LLMs hallucinate, even at workout generation

We run 19+ specialized AI agents to build adaptive training programs. The core one, ExerciseSelector, picks the actual exercises, sets, and rep ranges for every session. It works well—around 92% of first attempts pass validation on the first try.

That leaves 8% of generations that go wrong in some way. Most of those are recoverable with a retry and targeted feedback. But a few slip through with subtle, high-severity bugs:

  • A “reverse cable chest pull” that doesn't exist in our exercise catalog.
  • A beginner push session with 34 total chest sets (physically impossible in 90 minutes).
  • An incline barbell bench prescribed to a user who only owns dumbbells.
  • Overhead pressing scheduled for a user with an active shoulder injury insight.

Retrying the AI usually fixes these, but only if we detect them. The validation loop catches the obvious structural issues (bad JSON, missing fields), but semantic problems like “this exercise name doesn't exist” or “this user can't safely do this” slip through more often than we'd like.

The fix: a dedicated reviewer that looks at the finished workout and says “yes” or “no” before it's persisted.

Why deterministic first, LLM second

Our initial instinct was to just add another gpt-5 call: “here's the generated workout, is it safe?” That works, but it's expensive, slow, and—ironically— susceptible to the same hallucinations we're trying to catch. LLMs reviewing LLMs can miss the exact errors a deterministic check would flag instantly.

So we stacked two layers:

┌──────────────────────────────────────────────────────────┐
│  LAYER 1: Deterministic guardrails (TypeScript)          │
│  - 8 checks, all pure functions, < 10ms                   │
│  - Hard failures (hallucinated exercise) → reject        │
│  - Soft failures (volume 10% over cap) → flag            │
├──────────────────────────────────────────────────────────┤
│  LAYER 2: Senior Reviewer sub-agent (gpt-5)              │
│  - Runs on recovery path only                             │
│  - Evaluates holistic coherence, flow, pacing             │
│  - Returns: approved │ regenerate │ annotate             │
└──────────────────────────────────────────────────────────┘

Layer 1 is dumb but fast. It catches 80%+ of real failures at essentially zero cost. Layer 2 is smart but expensive; it only runs when Layer 1 surfaces something borderline, keeping the happy path free of extra LLM calls.

The 8 deterministic guardrails

Each guardrail is a pure function that takes the generated workout and returns { pass: boolean, severity, message }. Here's what they check and why.

1. Volume explosion cap

Hard cap on total working sets per session, scaled by experience level:

const VOLUME_CAPS = {
  beginner:     22,   // sets per session
  intermediate: 26,
  advanced:     32,
}

if (totalSets > VOLUME_CAPS[user.experience]) {
  return { pass: false, severity: 'high', ... }
}

These numbers come from literature on weekly MRV (maximum recoverable volume) spread across typical training frequencies. A beginner doing 34 sets in one session isn't training; they're injuring themselves. The cap also protects against a subtle bug where the generator picks “5 sets” for 7 exercises and ends up far above target.

2. Hallucinated exercise names

Every exercise in the output must have an exercise_id that exists in our MuscleWiki cache (1,737 exercises). A generated name like “reverse cable chest pull” with no matching ID fails immediately. This is a hard reject—we regenerate.

3. Equipment mismatch

Cross-reference each exercise's required equipment against the user's profile. Barbell bench for a dumbbell-only home gym is a fail. We found this was the single most common user-visible issue before the reviewer shipped—especially for users who had updated their equipment recently.

4. Injury conflict

If the user has an active injury insight (e.g., left shoulder, severity medium), overhead pressing, barbell bench, and dips get flagged as conflicts. The reviewer doesn't ban them outright—experienced users can override—but it forces the generator to regenerate with explicit avoidance.

5. Duplicate exercises

Two sets of 3x10 incline dumbbell press in the same session is almost always a generation error. The guardrail allows intentional duplicates for advanced techniques (rest-pause, drop sets) but flags naive duplicates.

6. Missing primary muscle

If the session type is “Chest + Triceps” but the output has zero chest-primary exercises, something went wrong. This catches rare failure modes where the AI swaps muscle groups mid-generation.

7. Invalid rep range

Rep ranges must parse as numeric and fall within [1, 30]. We've seen the model output “30-60 seconds” for time-based exercises in a rep field, which breaks downstream progression math. We reject and regenerate with explicit instructions.

8. Incompatible superset

Supersets pair two exercises. If the pair includes two heavy barbell compound lifts (e.g., deadlift + squat) or two unilateral exercises with different setups, the superset is impractical. The guardrail catches pairings that don't physically work in a real gym.

Results: 30/30 fixtures, 5/5 production cases

Our test fixture started as 15 synthetic cases—volume explosions, hallucinated names, equipment mismatches. All passed. We then expanded to 30 cases including subtle combinations (e.g., “volume at the cap limit + one duplicate exercise + a borderline rep range”) to stress-test interaction effects.

The more interesting validation was pulling 5 real production bugs from our support logs and running them through the reviewer:

  • Case 1 — “face pull” ambiguity, caught by exercise ID check.
  • Case 2 — 28 chest sets for a beginner, caught by volume cap.
  • Case 3 — barbell lifts for a kettlebell-only profile, caught by equipment mismatch.
  • Case 4 — overhead press with active shoulder injury, caught by injury conflict.
  • Case 5 — duplicated incline DB press (6 sets instead of 3), caught by duplicate check.

5/5. The reviewer would have intercepted all of them before they reached the user. These weren't handpicked to be easy catches—they were the actual top-5 workout generation issues we had logs for.

Observability: weekly digest, not silent rejection

A guardrail that silently regenerates is a guardrail you can't debug. Every reviewer invocation is logged via the admin client—reason, severity, input workout, output decision. An Inngest cron aggregates a week of these into an admin digest:

Weekly Senior Reviewer digest (2026-04-14 → 2026-04-21)

Total invocations: 3,241
Pass rate (layer 1):   94.2%
Soft flags (allowed):   3.1%
Hard rejects:           2.7%  (88 regenerations)

Top reject reasons:
  1. Equipment mismatch          41
  2. Volume cap exceeded         22
  3. Duplicate exercises         12
  4. Hallucinated exercise ID     8
  5. Invalid rep range            5

Sub-agent (layer 2) invocations: 154
  Approved:    102
  Regenerate:   38
  Annotated:    14

The digest is the feedback loop that prevents drift. If equipment mismatch spikes one week, that's a signal that we've changed something in the prompt or the equipment taxonomy that's breaking the generator's assumptions. We can catch it in days, not months.

Why it matters

The reviewer isn't a silver bullet. It can't catch semantic issues like “this exercise selection is technically valid but suboptimal for your goal.” That's what the learning layer and insights generators are for.

What it does give us is a floor: no user ever receives a workout that fails any of the 8 deterministic checks. Combined with the validation retry loop in the primary generator, that floor is high enough that support tickets for “weird workout” have dropped meaningfully since ship.

This reviewer powers the workout generation inside Arvo. If you want to see it in practice, the app is free to try—Pro starts at €4/month (pricing). For the broader architecture, see our write-up on the multi-agent periodization engine and the companion features like AI Cardio Coach and Gym Crew.


Volume caps are derived from Schoenfeld et al. (2017) on weekly dose-response for hypertrophy, adapted for per-session ceilings. Injury conflict heuristics are grounded in generally accepted orthopedic guidance; Arvo's recommendations are not a substitute for medical advice.