Skip to content

Instantly share code, notes, and snippets.

@winklerj
Forked from hamelsmu/faq_2.md
Created May 27, 2025 17:08
Show Gist options
  • Save winklerj/84ae05a5ed76d163fd75c42513f3a297 to your computer and use it in GitHub Desktop.
Save winklerj/84ae05a5ed76d163fd75c42513f3a297 to your computer and use it in GitHub Desktop.

Generating Synthetic Data for LLM Evaluation

Synthetic data generation is not about creating random test cases. It's about systematically surfacing specific failure modes in your LLM application.

Start with Real Usage, Not Synthetic Data

Before generating any synthetic data, use your application yourself. Try different scenarios, edge cases, and realistic workflows. If you can't use it extensively, recruit 2-3 people to test it while you observe their interactions.

Generate Data to Test Specific Hypotheses

Create synthetic data only when you have a clear hypothesis about your applications's failure modes. Synthetic data is most valuable for failure modes that:

  • Require systematic testing across many variations to understand the pattern
  • Occur infrequently in natural usage but have high impact when they do occur
  • Involve complex interactions between multiple system components

Structure Generation with Dimensions

When real user data is sparse, use structured generation rather than asking an LLM for "random queries."

Define 3-4 key dimensions that represent where your application is likely to fail. For a recipe bot:

  • Recipe Type: Main dish, dessert, snack, side dish
  • User Persona: Beginner cook, busy parent, fitness enthusiast, professional chef
  • Constraint Complexity: Single constraint, multiple constraints, conflicting constraints

Create tuples by combining one value from each dimension:

  • (Main dish, Beginner cook, Single constraint)
  • (Dessert, Busy parent, Multiple constraints)

Generate queries from tuples using a second LLM call. This two-step process produces more diverse, realistic queries than single-step generation.

Scale Based on Iteration Needs

Start with around 100 synthetic examples to achieve sufficient coverage and approach theoretical saturation—the point where additional examples reveal few new failure modes. Focus on:

  • High-impact failure modes: Problems that affect core user workflows
  • Coverage gaps: Scenarios underrepresented in your real usage data
  • Systematic failure patterns: Issues that require testing across multiple variations

Generate more examples only if initial testing reveals additional failure modes that need deeper exploration.

Practical Implementation

  1. Use your application extensively to build intuition about failure modes
  2. Define 3-4 dimensions based on observed or anticipated failures
  3. Create 5-10 structured tuples covering your priority failure scenarios
  4. Generate natural language queries from each tuple using a separate LLM call
  5. Scale to more examples across your most important failure hypotheses (we suggest at least ~100)
  6. Test and iterate on the most critical failure modes first

The goal is targeted failure discovery, not comprehensive test coverage. Generate synthetic data strategically to accelerate iteration on problems that matter for your users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment