How many AI agents does Arvo use to generate workouts?

Arvo orchestrates 19 specialized AI agents that coordinate in real-time — each handling a specific aspect like exercise selection, load calculation, fatigue estimation, and periodization logic.

How does Arvo keep AI workout generation costs low?

Through cost routing (simple sessions use gpt-5-mini at 10x lower cost) and OpenAI prefix caching, which restructures prompts as static → semi-static → dynamic for maximum cache hit rates.

How fast does Arvo generate a personalized workout?

Under 500ms. Arvo parallelizes independent agent calls and uses a JSON protocol instead of chain-of-thought reasoning to minimize latency.

What is multi-agent periodization?

A system where multiple specialized AI agents coordinate in real-time to adapt training variables — volume, intensity, exercise selection, and rest periods — based on the athlete's performance feedback after every set.

Does Arvo replace a personal trainer?

Arvo applies evidence-based periodization principles like RPE/RIR auto-regulation and progressive overload. It handles real-time adaptation that static plans can't, at a fraction of a human coach's cost — starting at €4/month.

How I Built a Real-Time Periodization Engine Using Multi-Agent AI

The Problem: Static Plans in a Dynamic Sport

You walk into the gym. Your app says bench press, 80kg, 3 sets of 10. You do set 1 and it feels like RPE 6—you could have done 14 reps. A human coach would say “bump it to 85.” Your app says nothing. You do set 2 at 80kg. Same thing. Set 3, same thing. You just did 3 sets of submaximal work that barely challenged your body.

The next day, you're sore from yesterday's deadlifts and slept 5 hours. Your app says squat 120kg. A coach would see the fatigue in your warm-ups and drop you to 110. Your app doesn't know you're tired. It doesn't know you slept badly. It doesn't know your left knee has been bothering you for two weeks.

This is the fundamental gap in fitness apps. They can track what you did, but they can't decide what you should do next based on how you're actually performing right now. Real periodization—the science of planning training over time—requires adapting to daily readiness, accumulated fatigue across a training cycle, and individual biomechanical constraints. That's a lot of context for a single AI prompt.

I built Arvo to solve this. It took two years, 30+ specialized AI agents, and a lot of wrong turns. Here's the technical breakdown.

The Architecture: Why Multi-Agent, Not Multi-Prompt

My first version was a single massive prompt. I stuffed everything into one GPT call: user profile, exercise history, periodization rules, equipment constraints, injury context, training methodology. It was 8,000+ tokens of system prompt, and the results were inconsistent. The model would respect the periodization phase but forget the equipment constraints, or nail the exercise selection but ignore the injury context.

The problem is context competition. When you give an LLM 15 different constraints in a single prompt, it has to implicitly prioritize them. Sometimes it gets the priority right, sometimes it doesn't. And you can't debug which constraint won when things go wrong.

So I broke it into specialized agents. Each agent has one job, a focused prompt, and a clear contract for its input/output. Here's the layer diagram:


┌─────────────────────────────────────────────────────────┐
│  PLANNING LAYER                                         │
│  SplitPlanner → ExerciseSelector → WorkoutRationale     │
│  Output: Full training cycle (7-10 day split plan)      │
├─────────────────────────────────────────────────────────┤
│  EXECUTION LAYER (real-time, in-gym)                    │
│  ProgressionCalculator → AudioScriptGenerator           │
│  Output: Per-set weight/rep targets, voice coaching     │
├─────────────────────────────────────────────────────────┤
│  VALIDATION LAYER                                       │
│  ExerciseAdditionValidator │ ModificationValidator │     │
│  SubstitutionAgent │ ReorderValidator │ EquipmentCheck  │
│  Output: approved / caution / not_recommended           │
├─────────────────────────────────────────────────────────┤
│  LEARNING LAYER (post-workout)                          │
│  InsightsGenerator → MemoryConsolidator →               │
│  TrainingInsights → TechniqueRecommender                │
│  Output: Injury flags, learned preferences, patterns    │
├─────────────────────────────────────────────────────────┤
│  SUPPORT                                                │
│  SupportChat │ SkipImpact │ ApproachRecommender │ ...   │
│  Output: Conversational Q&A, impact analysis            │
└─────────────────────────────────────────────────────────┘

Each layer has different latency requirements and cost profiles. The Planning layer runs once when generating a workout (can take 30-90 seconds, uses the best model). The Execution layer runs between sets (must be fast, uses cheaper models). The Learning layer runs post-workout in the background (no latency constraint, optimizes for quality). The Validation layer runs on-demand when users modify their workout mid-session.

What each layer actually does

SplitPlanner designs the macro structure: a 7-10 day training cycle with specific session types (Push A, Pull B, Legs A, etc.), volume targets per muscle group, and periodization phase. It knows about 6 different training methodologies (5/3/1, FST-7, Y3T, Mountain Dog, etc.) and their specific constraints.

ExerciseSelector is the heaviest agent. Given a session type (e.g. “Push A, chest 12 sets, shoulders 6 sets, triceps 4 sets”), it selects the specific exercises, sets, rep ranges, tempo, rest times, and advanced techniques. It receives the user's equipment list, injury context, recent exercise history (to avoid repetition), and learned preferences from past workouts.

ProgressionCalculator handles the between-set adaptation. When you log a set at RPE 6 (too easy), it recalculates the target for your next set. When you're at RPE 9.5 (near failure), it might suggest dropping weight or reducing reps. This runs on a cheaper, faster model because the decision space is small: same weight, more weight, or less weight.

InsightsGenerator runs after each workout and looks for patterns: “user has reported shoulder discomfort 3 times in 2 weeks on overhead pressing movements” gets flagged as a potential injury insight with a severity level. This insight then feeds back into the ExerciseSelector for future workouts—it might avoid overhead pressing or suggest alternatives.

MemoryConsolidator tracks implicit preferences. If you substitute dumbbell bench for barbell bench 4 times in a row, the system learns “this user prefers dumbbell bench” with a confidence score. High-confidence memories become hard constraints for future exercise selection.

Three Technical Decisions That Actually Mattered

1. Cost routing: 10x savings on simple workouts

Not every workout needs the best model. A “Biceps & Triceps” session for an experienced lifter with no injuries is straightforward. A “Full Body” session for a beginner with knee pain and custom equipment is complex. I route them to different models:

// workout-generator.service.ts
const isSimpleSession =
  targetMuscleCount > 0 && targetMuscleCount <= 2
  && !hasHighSeverityInsights
  && isExperienced
  && workoutType !== 'full_body'

const routedModel = isSimpleSession ? 'gpt-5-mini' : undefined
// gpt-5-mini: $0.30/1M input, $1.20/1M output
// gpt-5.4:    $2.50/1M input, $15.00/1M output

The criteria: 2 or fewer target muscle groups, no active injury insights, 1+ years of training experience, and not a full-body session. This catches about 40-50% of workouts. The cost difference per request is roughly $0.004 vs $0.040—a 10x multiplier that compounds fast when you have users generating daily workouts.

The key insight: the model quality difference is negligible for simple sessions because the prompt is extremely constrained. When you tell the model “select exactly 4 exercises for biceps and triceps, 3 sets each, from this equipment list”, there's not much room for the bigger model to be meaningfully better.

2. Prompt architecture for prefix caching

OpenAI automatically caches identical prompt prefixes longer than 1024 tokens at a 75% discount. I restructured all major prompts into three sections:

// exercise-selector.agent.ts prompt structure

// SECTION 1: STATIC (~4,200 tokens)
// Identical for all users on the same training approach
// → Cached after first request, 75% discount on subsequent
"Training approach: Kuba Method guidelines..."
"Output format: JSON schema..."
"Muscle taxonomy: exact keys required..."
"Advanced techniques: drop sets, rest-pause..."

// SECTION 2: SEMI-STATIC (~1,200 tokens)
// Changes per mesocycle phase (weekly)
// → Cached within same training week
"Periodization: Week 3, Accumulation phase..."
"Caloric phase: Bulk, +300 kcal surplus..."

// SECTION 3: DYNAMIC (~2,500 tokens)
// Changes every request
// → Never cached
"User: 28yo male, 3 years experience..."
"Today's session: Push A, chest 12 sets..."
"Recent exercises: [avoid these]..."
"Active insights: left shoulder pain..."

The ordering matters. OpenAI caches from the beginning of the prompt, so the static section must come first. For a typical user, ~5,400 of 7,900 input tokens hit the cache (68%), reducing the effective input cost from $2.50 to ~$1.10 per million tokens.

A subtlety: users on the same training approach (e.g., everyone doing “Kuba Method”) share the same Section 1 cache. So the first user of the day pays full price, but every subsequent user on the same approach gets the discount. With 6 supported approaches, the caches warm up fast.

3. Constraint-based generation, not free-form

The biggest quality improvement came from telling the AI exactly what to produce, not asking it to figure it out. Instead of “generate a push workout,” the prompt says:

TARGET VOLUMES:
• chest: 12 sets
• shoulders: 6 sets
• triceps: 4 sets
TOTAL: 22 sets

EXERCISES REQUIRED: 5
SETS PER EXERCISE: 3 (from approach constraints)
→ 5 × 3 = 15 set capacity

VOLUME GAP: 22 - 15 = 7 sets
→ Recalculated: some exercises need 4-5 sets each

CONSTRAINT: Primary muscles count 1.0x, secondary 0.5x
toward volume targets. Validate ±20% per muscle.

This pre-calculation happens in TypeScript, not in the prompt. The AI receives exact constraints and just needs to pick exercises that satisfy them. I then validate the output—count the actual volume achieved per muscle group and check it's within 20% of the target. If validation fails, the AI gets specific feedback and retries (up to 3 attempts):

// base.agent.ts - validation retry loop
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
  const feedbackSection = lastFeedback
    ? `⚠️ PREVIOUS ATTEMPT ${attempt - 1} FAILED:
       ${lastFeedback}
       ✅ CORRECTIVE ACTIONS REQUIRED:
       - Fix the specific issues above
       - Double-check volume calculations`
    : ''

  const result = await this.complete(
    prompt + feedbackSection,
    targetLanguage,
    baseTimeout * (1 + (attempt - 1) * 0.5) // Progressive timeout
  )

  const validation = await validationFn(result)
  if (validation.valid) return result
  lastFeedback = validation.feedback
}

In practice, the first attempt succeeds ~92% of the time. When it fails, it's usually a volume mismatch or an invalid rep range (the model sometimes outputs seconds instead of reps for time-based exercises). The retry with targeted feedback fixes it almost always.

What I Got Wrong

The $47/month reasoning chain disaster

OpenAI's Responses API lets you chain responses by passing a previous_response_id, preserving the model's chain-of-thought across calls. I thought this would be perfect for workout generation: the model could “remember” its reasoning from your previous workout and build on it.

In testing, it improved quality by about 4% on my evaluation benchmark. So I shipped it.

Two weeks later, I checked costs. Seven power users who trained daily had accumulated 200,000+ token contexts. Each workout generation for these users cost ~$3 instead of $0.04. One user alone was costing $47/month in API calls. The chain was growing by ~12,000 tokens per workout and never getting trimmed.

The fix was embarrassingly obvious: my prompt already contains all the context the model needs (user profile, recent exercises, periodization phase, injury insights, learned memories). The reasoning chain was adding zero new information—just 12x the cost. I disabled it entirely and saw no quality degradation in production. The 4% improvement in my benchmark was noise.

RIR estimation is harder than I thought

RIR (Reps In Reserve) is the number of reps you could have done but didn't. It's the gold standard for auto-regulation in strength training. The problem: most people are terrible at estimating it. Beginners routinely report RIR 3 when they're actually at RIR 0 (true failure). Advanced lifters tend to underestimate.

My first version trusted user-reported RIR directly: if you said RIR 3, the system would keep your weight the same. But beginners would stagnate for weeks because they were unknowingly training at failure while reporting “3 reps in reserve.”

The current approach calibrates implicitly. If a user consistently reports RIR 2-3 but never increases weight or reps over 3+ weeks, the system flags potential RIR miscalibration and adjusts its progression logic to be more aggressive. It's not perfect, but it catches the most common failure mode.

Where It Stands

The system generates a complete, personalized workout in 30-90 seconds depending on complexity. Average cost per generation is ~$0.01 after caching and model routing (down from ~$0.04 before optimization). The agent count has grown to 31 files in production, though not all run on every request—most users trigger 3-5 agents per workout.

What I'm most proud of is the adaptation loop. After 2-3 weeks of logging workouts, the system has enough data to make genuinely personalized decisions: it knows your equipment, your preferences (learned from substitution patterns), your weak points, your injury history, and your current fatigue level. The prompts are still doing the heavy lifting, but the context they receive is deeply individual.

What I'm still not happy with: generation speed. 30-90 seconds is too long. I'm exploring pre-generation (create tomorrow's workout overnight) and more aggressive caching of common workout patterns. The UX currently masks the wait with progress indicators, but it's a band-aid.

Try It

This architecture powers Arvo, a training app I built for iOS and Android. If you train seriously and want to see what AI-adapted periodization feels like in practice, I'd genuinely love feedback. The pricing page explains the business model (freemium, starting at €4/month Pro).

If you have questions about the architecture, I'm happy to go deep in the comments. The biggest open problem I'm working on is making the system faster without sacrificing personalization quality—pre-generation with delta updates seems promising but introduces consistency challenges.

References: Helms et al. (2016) on RPE-based auto-regulation; Zourdos et al. (2016) on RIR as a tool for resistance training prescription; Schoenfeld et al. (2017) on volume dose-response for hypertrophy. The periodization model draws heavily from Renaissance Periodization's MEV/MAV/MRV framework.