Robb Winkle winklerj

Generating Synthetic Data for LLM Evaluation

Synthetic data generation is not about creating random test cases. It's about systematically surfacing specific failure modes in your LLM application.

Start with Real Usage, Not Synthetic Data

Before generating any synthetic data, use your application yourself. Try different scenarios, edge cases, and realistic workflows. If you can't use it extensively, recruit 2-3 people to test it while you observe their interactions.

Generate Data to Test Specific Hypotheses

<TITLE>

Problem Statement

Requirements

Functional Requirements

Stevey's Google Platforms Rant

I was at Amazon for about six and a half years, and now I've been at Google for that long. One thing that struck me immediately about the two companies -- an impression that has been reinforced almost daily -- is that Amazon does everything wrong, and Google does everything right. Sure, it's a sweeping generalization, but a surprisingly accurate one. It's pretty crazy. There are probably a hundred or even two hundred different ways you can compare the two companies, and Google is superior in all but three of them, if I recall correctly. I actually did a spreadsheet at one point but Legal wouldn't let me show it to anyone, even though recruiting loved it.

I mean, just to give you a very brief taste: Amazon's recruiting process is fundamentally flawed by having teams hire for themselves, so their hiring bar is incredibly inconsistent across teams, despite various efforts they've made to level it out. And their operations are a mess; they don't real

	"""Script for generating queries for a recipe search engine.

	This script can be used to generate synthetic queries for a recipe search engine.
	Following best practices from Hamel and Shreya's AI Evals course, we:

	- Generate a set of dimensions that can be used to generate queries; these are attributes that significantly change what the query is about or how it is written.
	- For each set of attributes ("Dimensions") we generate a query that matches those attributes.

	To ensure that the synthetic generation is better aligned, we first try handwriting the queries using the --manual flag.
	This gives us labeled examples to use few shot in our synthetic generation.