From GPT-4o to Structured Outputs: A Migration Playbook

A practical playbook for migrating AI features from free-form GPT-4o responses to structured outputs with newer models. Schema design, validation, retry patterns, and the quality tradeoffs we navigated.

Arvo Team
11 min read
June 2026
EngineeringAIMigration

How do you migrate from GPT-4o to structured outputs?

Define strict JSON schemas (we use Zod), restructure prompts to match the schema's expectations, add output validation with typed retry loops, and roll out gradually (10% → 50% → 100%). The migration cut our output parsing errors from 12% to under 1% and reduced output token usage by 35% — but required careful schema design to avoid constraining the model's useful reasoning.

TL;DR

  • Migrating from free-form to structured outputs cut our parsing errors from 12% to under 1% and reduced output tokens by 35%.
  • Schema design is the hardest part: too strict and you lose useful model reasoning; too loose and you get garbage. Start strict, relax where needed.
  • The validation retry pattern — validate → generate specific feedback → retry with feedback — achieves 92% first-attempt success and 99.5% within 3 attempts.
  • Biggest gotcha: structured outputs can reduce creative quality. The model optimizes for schema compliance over output quality if the schema is overly constraining.
  • Zod → JSON Schema conversion is the cleanest DX for TypeScript projects. Define once, validate everywhere.

The Free-Form Problem

Before structured outputs, generating workout data with GPT-4o looked like this: send a detailed system prompt, get back a blob of text, and hope it contained valid JSON. We used regex to strip markdown fences, JSON.parse to deserialize, and a prayer to make sure the resulting object matched our expected shape.

It worked — 88% of the time. The other 12% was a graveyard of parsing failures: the model wrapping JSON in conversational text (“Sure! Here's your workout:”), returning arrays where we expected objects, omitting required fields, or inventing new ones. Each failure meant a retry — more latency, more tokens burned, and a worse user experience for someone standing in the gym waiting for their next set.

// The old way: pray and parse
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'system', content: prompt }],
});

const text = response.choices[0].message.content ?? '';
// Sometimes the model wraps JSON in ```json blocks
const cleaned = text.replace(/```json\n?/g, '').replace(/```\n?/g, '');
try {
  const data = JSON.parse(cleaned);
  // Hope it matches our expected shape...
} catch {
  // 12% of the time, we end up here
  retryCount++;
}

The 12% failure rate was bad enough on its own. But the downstream effects were worse: every retry doubled the latency for that user, inflated our OpenAI bill, and created cascading issues in our multi-agent pipeline where one agent's output feeds the next. A malformed exercise list from the ExerciseSelector meant the validation agent would reject the entire workout, triggering a full re-generation.

What Structured Outputs Actually Give You

OpenAI's structured outputs let you pass a JSON schema alongside your request. The model is constrained to only produce tokens that result in valid JSON matching that schema. No more parsing gymnastics — the response is always valid JSON in the exact shape you specified.

// The new way: schema-guaranteed output
const response = await openai.chat.completions.create({
  model: 'gpt-5-mini',
  messages: [{ role: 'system', content: prompt }],
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'workout_exercises',
      strict: true,
      schema: exerciseJsonSchema,
    },
  },
});

// Always valid JSON, always matches schema
const data = JSON.parse(response.choices[0].message.content!);
// TypeScript knows the shape — no type assertion needed

The difference was immediate: parsing errors dropped from 12% to under 1%. The remaining failures are network timeouts and rate limits — not format issues. We no longer need the regex cleanup, the try/catch dance, or the retry-on-parse-error logic. The model simply cannot produce malformed output.

There's a subtle but important shift here: with free-form outputs, we were parsing text and hoping for structure. With structured outputs, we're parsing guaranteed JSON and validating semantics. The failure mode moved from “is this even JSON?” to “are these values reasonable?” — a much better place to be.

Schema Design: The Art of Constraint

The hardest part of structured outputs isn't the API call. It's designing the schema. Too strict and the model can't express useful nuance — it fills in technically valid but semantically empty values just to satisfy the schema. Too loose and you get the same garbage data you were trying to escape, just wrapped in valid JSON.

Here's the schema we landed on for exercise selection, defined in Zod and converted to JSON Schema:

const exerciseSchema = z.object({
  name: z.string().describe('Exercise name in English'),
  muscleGroup: z.enum(['chest', 'back', 'shoulders',
    'biceps', 'triceps', 'quads', 'hamstrings',
    'glutes', 'calves', 'abs', 'forearms']),
  sets: z.number().int().min(1).max(8),
  reps: z.string().describe('Rep range like "8-12" or duration like "30s"'),
  rir: z.number().int().min(0).max(5),
  restSeconds: z.number().int().min(30).max(300),
  rationale: z.string().max(100)
    .describe('Why this exercise was chosen, max 20 words'),
  technique: z.enum([
    'standard', 'drop_set', 'rest_pause',
    'myo_reps', 'superset'
  ]).optional(),
});

const workoutSchema = z.object({
  exercises: z.array(exerciseSchema).min(3).max(10),
  totalSets: z.number().int(),
  estimatedDuration: z.number().int()
    .describe('Minutes'),
  sessionNotes: z.string().max(200).optional(),
});

Every field in this schema represents a design decision we iterated on. The ones that matter most:

  • muscleGroup as enum vs. free string. Enum prevents the model from inventing categories (“Chest” vs. “chest” vs. “pecs” vs. “upper chest”). But it forces you to commit to a taxonomy. We chose a flat 11-value enum that maps directly to our database schema. If the model thinks an exercise hits multiple muscle groups, it picks the primary one — we handle secondary muscles separately.
  • rationale capped at 100 characters. Without this cap, the model generates 200+ word justifications for every exercise. Those extra tokens cost money and add no value to the end user. Capping at 100 chars (with a “max 20 words” instruction in the description) keeps the model concise. This single change cut output tokens by roughly 15% across all exercise selection calls.
  • technique as optional. Not every exercise needs an advanced technique. Making it required forced the model to output “standard” on 70% of exercises — meaningless noise. Making it optional lets the model omit it entirely when standard form is appropriate, which is most of the time.
  • reps as string, not number. This accommodates both “8-12” ranges and“30s” time-based sets. A number field would force us to split into min/max/unit fields, over-complicating the schema for a value that's ultimately displayed as-is to the user.

The Three-Layer Validation Pattern

Structured outputs guarantee the shape of the data, but not its meaning. The model will always return valid JSON with the right types — but it might return 15 sets of bench press for a “light recovery” session. You need validation beyond the schema.

We use three layers:

Layer 1: Schema validation. Handled entirely by OpenAI's structured output mode with strict: true. The response always passes schema validation. We don't even check — the API guarantees it.

Layer 2: Semantic validation. Are the values reasonable? We check things like: is the total set count within ±20% of the target volume? Are rest periods appropriate for the exercise type (compound movements get 120-180s, isolation gets 60-90s)? Is the estimated duration plausible given the exercise count and rest periods?

Layer 3: Domain validation. Does the output satisfy business rules? No exercises that conflict with the user's injury flags. No duplicate exercises. Volume per muscle group matches the periodization phase requirements. Advanced techniques only appear when the user's experience level supports them.

When validation fails, we don't just retry blindly. We feed the specific failure reason back into the prompt:

async function generateWithValidation<T>(
  prompt: string,
  schema: z.ZodType<T>,
  validate: (data: T) => { valid: boolean; feedback: string },
  maxAttempts = 3
): Promise<T> {
  let lastFeedback = '';

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const fullPrompt = lastFeedback
      ? `${prompt}\n\n⚠️ PREVIOUS ATTEMPT FAILED:\n${lastFeedback}\nFix the specific issues above.`
      : prompt;

    const result = await callWithSchema(fullPrompt, schema);
    const validation = validate(result);

    if (validation.valid) return result;
    lastFeedback = validation.feedback;
  }

  throw new Error(`Failed after ${maxAttempts} attempts: ${lastFeedback}`);
}

The feedback is specific: not “invalid output” but “chest volume was 8 sets, target is 12-14 sets” or “barbell row conflicts with user's lower back injury flag.” This targeted feedback is what makes the retry loop effective. The model can fix exactly what's wrong without re-generating everything from scratch.

Success rates with this pattern: 92% first attempt, 99.1% second attempt, 99.5% within 3 attempts. The remaining 0.5% are genuinely ambiguous edge cases — conflicting constraints where the user's injury flags, periodization phase, and available equipment make it impossible to satisfy all rules simultaneously. Those get flagged for human review.

The Quality Tradeoff Nobody Warns You About

Here's the thing nobody tells you in the docs: structured outputs can reduce creative quality. When the model knows it must produce exactly { name: string, sets: number, ... }, it optimizes for compliance over insight.

In free-form mode, GPT-4o might volunteer something like: “Consider supersetting bicep curls with tricep pushdowns to save time — your session is running long and these muscle groups don't interfere.” In structured output mode, that insight has nowhere to go. The model fills in the sessionNotes field if you're lucky, but more often it just outputs the exercises without the contextual reasoning.

We found this tradeoff isn't uniform. For some tasks, structured output is strictly better. For others, it sacrifices exactly the kind of output you want.

Structured vs Free-Form Output Quality

AspectStructured WinsFree-Form Wins
Data accuracyGuaranteed validMay have format errors
Token efficiency35% fewer tokensVerbose explanations
Parsing reliability99.5% success88% success
Creative suggestionsLimited by schemaUnconstrained
DebuggingClear error feedbackOpaque failures
CostLower (fewer tokens + retries)Higher

Our approach: use structured outputs for the core data — exercise selection, set/rep assignment, load calculations — where correctness matters more than creativity. For supplementary commentary like session insights or training tips, we either use an optional free-text field within the schema or make a separate unstructured call. Two API calls sounds wasteful, but the combined cost is still lower than the old free-form-with-retries approach because the structured call uses 35% fewer output tokens and almost never retries.

The Migration Checklist

If you're considering the same migration, here's the step-by-step playbook that worked for us:

  1. Audit your current parsing failures. What percentage of your LLM calls fail to parse? If it's under 2%, the migration might not be worth the effort. Ours was 12% — that made it a clear win.
  2. Define your schema in Zod (or your language's equivalent). Start strict — you can always relax constraints later. It's much harder to tighten a schema after your code depends on the loose version.
  3. Build the validation layer before migrating. Run your semantic and domain validators against your existing successful outputs. If your validators reject more than 5% of currently-working outputs, your constraints are too tight.
  4. Set up A/B testing. Route 10% of traffic to structured outputs and compare quality scores, latency, and token usage against the free-form baseline.
  5. Monitor output token usage. You should see a 25-40% reduction. If you don't, your schema probably has too many optional free-text fields that the model is filling verbosely.
  6. Roll out gradually: 10% → 50% → 100% over 3 weeks. Watch error rates and user feedback at each stage.
  7. Keep free-form as a fallback for the first month. If structured output fails (network timeout, schema rejection), fall back to the old parsing path. Remove it only after 30 days of stable structured output.
  8. Delete the old parsing code. Once you're confident, remove the regex cleanup, the try/catch retry logic, and the free-form fallback. Dead code is a maintenance burden and a false sense of safety.

All code examples are simplified from Arvo's production codebase. OpenAI API features referenced are current as of June 2026. See our developer docs for more on Arvo's architecture and the cost reduction post for the broader optimization strategy.