We Cut AI Costs by 54% Moving to GPT-5.4-mini
A practical guide to migrating from GPT-4o to GPT-5.4-mini/nano while maintaining quality. Cost routing, prompt restructuring, and the tradeoffs we made.
How to reduce OpenAI API costs?
The three most impactful strategies: (1) route simple requests to cheaper models like GPT-5.4-mini/nano, (2) restructure prompts for prefix caching (static content first), and (3) reduce output tokens with structured JSON schemas. We cut costs by 54% combining all three.
TL;DR
- •Migrated from GPT-4o to GPT-5.4-mini (primary) and GPT-5.4-nano (simple tasks), cutting average cost per request by 54%.
- •Quality remained within 2% on our eval benchmark for structured tasks — the key is highly constrained prompts with clear output schemas.
- •Prefix caching saves an additional 50-70% on input tokens by ordering prompts static → semi-static → dynamic.
- •Cost routing sends simple requests (2-3 parameters, low ambiguity) to nano, complex requests to mini, and edge cases to the full model.
- •Total monthly AI spend dropped from ~$1,200 to ~$550 with no user-visible quality regression.
The Starting Point: $1,200/Month and Growing
Arvo runs 19+ AI agents in production. Every workout generation triggers a chain of calls: the multi-agent periodization engine selects exercises, plans splits, generates insights, validates output, and logs progression data. Support chat handles freeform user questions with function-calling tools. Behind the scenes, smaller agents handle volume calculations, fatigue scoring, and exercise name normalization.
Before optimization, our monthly OpenAI bill sat at ~$1,200 and was growing at roughly 15% month-over-month as user count climbed. That growth rate meant we'd cross $2,000/month within three months if we did nothing. Here's where the money was going:
- ExerciseSelector — $480/mo (40%). The heaviest agent: large system prompts with training methodology, muscle taxonomy, and periodization rules, called once per session in every workout generation.
- SplitPlanner — $240/mo (20%). Plans the weekly split structure. Called less frequently but with large context windows including full cycle history.
- InsightsGenerator — $180/mo (15%). Analyzes training logs to produce actionable feedback like “your chest volume has stagnated for 3 weeks.”
- SupportChat — $120/mo (10%). Freeform chat with function calling. Lower volume but longer conversations with multiple tool calls per session.
- Validation agents — $96/mo (8%). Post-generation checks: volume within range, no contraindicated exercises for flagged injuries, correct JSON schema.
- Other (progression, memory, normalization) — $84/mo (7%).
The wake-up call came when we profiled individual users. One power user — generating 2+ workouts daily, asking follow-up questions, and triggering full re-plans — was costing us $47/month in reasoning chains alone. At a $9.99/month subscription price, the unit economics were upside down. We needed a systematic approach to cost reduction that wouldn't compromise the training quality our users rely on.
Strategy 1: Model Routing
The core insight is simple: not every request needs the best model. A request like “pick 4 biceps exercises from this list of 12, respecting these constraints” is a structured selection task. The model doesn't need deep reasoning — it needs to follow rules and output valid JSON. A request like “design a full-body workout for someone with a rotator cuff injury, a hip impingement, and 3 weeks until a powerlifting meet” requires genuine multi-constraint reasoning.
We classify each workout generation request by complexity at the routing layer before any AI call happens:
// workout-generator.service.ts
const isSimpleSession =
targetMuscleCount > 0 && targetMuscleCount <= 2
&& !hasHighSeverityInsights
&& isExperienced
&& workoutType !== 'full_body';
const model = isSimpleSession ? 'gpt-5-mini' : 'gpt-5.4';
// gpt-5-mini: ~$0.004/request avg
// gpt-5.4: ~$0.040/request avgThe criteria are deliberately conservative. We only route to the cheaper model when the task is unambiguous: few target muscles, no injury flags or high-severity insights that require careful reasoning, an experienced user whose profile is well-established, and not a full-body session (which requires balancing volume across many muscle groups).
In practice, roughly 45% of workout generations qualify as “simple” under these rules. That's a significant portion of traffic shifting from ~$0.04/request to ~$0.004/request — a 10x reduction on those calls. The key insight: for highly constrained prompts where the model is selecting from a fixed list and following explicit rules, cheaper models perform equivalently. There's no creative reasoning to degrade.
Impact from model routing alone: ~$350/month in savings.
Strategy 2: Prompt Restructuring for Prefix Caching
OpenAI automatically caches identical prompt prefixes longer than 1,024 tokens at a 75% discount on input token costs. The catch: the caching is prefix-based, meaning it only works on the contiguous beginning of the prompt. One dynamic token early in the prompt breaks the cache for everything after it.
Before optimization, our prompts mixed static and dynamic content with no particular ordering. The training approach guidelines might appear after the user's profile. The output schema might be sandwiched between session-specific data. This meant virtually zero cache hits because the prefix diverged immediately.
We restructured every major agent's prompt into three strict layers: static content first, semi-static content second, dynamic content last.
// exercise-selector.agent.ts
const prompt = [
// STATIC: identical for all users on same approach (~4,200 tok)
approachGuidelines, // training methodology rules
outputSchema, // JSON output format
muscleTaxonomy, // exact muscle group keys
advancedTechniques, // drop sets, rest-pause, etc.
// SEMI-STATIC: changes weekly (~1,200 tok)
periodizationContext, // "Week 3, Accumulation"
caloricPhaseContext, // "Bulk, +300 surplus"
// DYNAMIC: changes every request (~2,500 tok)
userProfile, // age, experience, equipment
sessionTarget, // "Push A, chest 12 sets"
recentExercises, // avoid repetition
activeInsights, // injury flags
].join('\n\n');The static block (~4,200 tokens) is identical for every user on the same training approach. Since Arvo supports a handful of approaches (hypertrophy, strength, powerbuilding, etc.), these prefixes are shared across thousands of requests. The semi-static block changes at most weekly — periodization phase and caloric context stay constant for 5–7 days per user, and many users share the same phase.
After restructuring, our cache hit rate climbed to ~68% of input tokens hitting cache across users on the same approach. On a prompt averaging ~7,900 input tokens, that means ~5,400 tokens are served at 75% discount.
We also compressed the individual prompt sections during this work. Caloric phase context went from ~1,200 tokens to ~400. Cycle fatigue context dropped from ~800 to ~400. Advanced techniques were reformatted from prose into a compact table. The compression alone reduced total prompt size by ~30%, and the reordering maximized caching on what remained.
Impact from prefix caching: ~$180/month in input cost savings.
Strategy 3: Structured Outputs to Reduce Token Waste
Before optimization, our ExerciseSelector would generate verbose rationales for every exercise choice — 100 to 200 words per exercise explaining why it was selected. Across 5–8 exercises per session, that's 500–1,600 tokens of output that users never see (the rationale is stored for debugging but not displayed). Output tokens are more expensive than input tokens, so this was pure waste.
We capped rationaleForSelection at 20 words in the prompt instructions and enforced it through a Zod schema with structured JSON output:
const exerciseSchema = z.object({
name: z.string(),
sets: z.number().int().min(1).max(8),
reps: z.string(), // "8-12" or "30s"
rir: z.number().int().min(0).max(5),
restSeconds: z.number().int(),
rationale: z.string().max(100), // capped!
technique: z.enum([
'standard', 'drop_set', 'rest_pause', 'myo_reps'
]).optional(),
});The schema serves double duty: it reduces output tokens (the model can't ramble when the schema constrains it) and it eliminates parsing failures (no more regex extraction from freeform text). Structured outputs also let the model skip generating formatting tokens — no markdown headers, no bullet points, no “Here's your workout:” preambles.
Total output tokens dropped by ~35% per request across all agents where we applied structured schemas.
Impact from structured outputs: ~$120/month in savings.
The Migration Process
We didn't flip a switch. Before changing any production model, we built an eval benchmark: 200 representative workout generation requests covering the full range of complexity — simple single-muscle sessions, complex full-body workouts, sessions with injury constraints, deload weeks, and edge cases like “first workout ever” for new users.
Each request was scored on four dimensions by human raters:
- Exercise appropriateness — Are the selected exercises suitable for the target muscles, equipment, and experience level?
- Volume accuracy — Does total volume (sets x reps) match the prescribed target within tolerance?
- Periodization compliance — Does the workout respect the current mesocycle phase (accumulation, intensification, deload)?
- Injury respect — Are contraindicated movements correctly avoided for users with flagged conditions?
We ran the full benchmark against GPT-5.4 as baseline, then against GPT-5-mini. Results: a 2% quality delta on structured tasks (exercise selection, volume calculation) — within measurement noise. On ambiguous or creative tasks (support chat edge cases, novel exercise suggestions), the delta widened to 8%.
The rollout followed a staged approach: 10% of traffic to the new routing for one week, monitoring quality metrics and user feedback daily. No regressions detected, so we moved to 50% for another week. Then 100%. Rollback criteria were explicit: any single quality dimension regressing more than 5%, or more than 3 user complaints per day attributable to workout quality, would trigger an immediate revert.
We never triggered the rollback.
What We Lost (and What We Didn't)
Honesty matters more than marketing here. GPT-5-mini is not GPT-5.4 in a smaller package. There are real differences, and pretending otherwise would set you up for surprises.
Where mini struggles:
- Ambiguous instructions — When a support chat user asks something vague like “my shoulder hurts, what should I do?” the full model produces more nuanced, contextually aware responses. Mini tends toward generic advice.
- Creative naming and variety — When generating exercise names or suggesting novel movement patterns, mini draws from a narrower repertoire. Over many sessions, users on the cheaper model see slightly less exercise variety.
- Complex multi-constraint reasoning — Full-body sessions for users with 3+ injury flags require juggling many constraints simultaneously. Mini occasionally drops a constraint or produces suboptimal volume distribution across muscle groups.
Where mini excels:
- Structured selection from constrained options — “Pick 4 from this list of 12, satisfying these rules” — identical performance to the full model.
- Following exact JSON schemas — Zero parsing failures on structured output, same as the full model.
- Mathematical relationships — Volume calculations, set/rep math, rest period scaling — mini is precise.
We kept GPT-5.4 for SupportChat (too much ambiguity in freeform conversation to risk quality) and for complex ExerciseSelector calls (full-body sessions, users with critical insights). The 2% quality delta on structured tasks is real but invisible to users — it manifests as slightly less creative exercise variety, not as incorrect programming or missed injury constraints. No user has ever reported noticing it.
The Numbers: Before and After
Cost Breakdown: Before vs After
| Before (GPT-5.4) | After (Optimized) | Savings | |
|---|---|---|---|
| ExerciseSelector | $480/mo | $210/mo | 56% |
| SplitPlanner | $240/mo | $110/mo | 54% |
| InsightsGenerator | $180/mo | $85/mo | 53% |
| SupportChat | $120/mo | $95/mo | 21% |
| Validation | $96/mo | $30/mo | 69% |
| Other | $84/mo | $20/mo | 76% |
| Total | $1,200/mo | $550/mo | 54% |
Per-user cost dropped from ~$0.04 to ~$0.018 per workout generation. For the power user who was costing us $47/month, the same usage pattern now costs ~$21 — still above the subscription price, but within a sustainable range when averaged across the user base. The overall unit economics flipped from concerning to healthy.
Recommendations for Your AI App
If you're running AI agents in production and the bill is growing faster than your revenue, here's what we'd recommend based on our experience:
- Profile first. Know which agents cost what before you optimize anything. We were surprised that ExerciseSelector was 40% of our bill — we'd assumed SupportChat was the biggest cost center because it handles the most visible user interactions.
- Start with model routing. It's the highest-impact, lowest-risk change. You don't need to modify prompts, schemas, or application logic — just add a routing layer that picks the model based on request complexity. Our 45% hit rate on “simple” classification was conservative; your mileage may be higher.
- Restructure prompts for caching even if you're not switching models. Prefix caching is free money. Reorder your prompts to put static content first, and you'll see immediate savings on input costs regardless of which model you use.
- Constrain outputs. Every token you don't generate is free. Use structured JSON schemas, cap string lengths, and eliminate verbose rationale fields that no user ever reads.
- Build an eval benchmark before migrating. You need objective quality measurement, not vibes. 200 representative requests with human-rated scores took us two days to build and saved us from shipping regressions we wouldn't have caught otherwise.
- Roll out gradually. 10% → 50% → 100% with clear rollback criteria at each stage. The cost of patience is low; the cost of a quality regression reaching all users is high.
- Keep the best model for genuinely ambiguous tasks. Freeform chat, creative generation, and complex multi-constraint reasoning are where model quality actually matters. Don't optimize the 10% of requests that need the most intelligence.
- Monitor continuously. Costs creep back up as features grow. We review our per-agent cost breakdown monthly and re-evaluate routing thresholds quarterly.
Related Articles
All costs referenced are based on OpenAI's published pricing as of March 2026. Your mileage will vary based on prompt length, output complexity, and usage patterns. See our developer docs for more on Arvo's architecture.