← Writing

Stop Guessing at Prompts

Your eval set doesn't just tell you how your model is doing. It's the objective function you need to make it better, automatically.

In my last piece I argued that evals are the real moat. A golden dataset that tells you whether your LLM output is good enough. But once teams have that dataset, most do something bizarre: they go back to tweaking prompts by hand.

"Let me add 'think step by step.' Let me rephrase paragraph three. Let me try telling it to be a doctor."

This is not engineering. This is tinkering. And there are better options now.

The optimization problem you already know

If you've ever fit a logistic regression, you understand prompt optimization. You have an objective function (your eval metric). You have parameters you want to tune (the prompt text). And you want to find the parameter values that maximize your objective.

With logistic regression, you use gradient descent. With prompts, you can't compute a gradient on text. But you can do something almost as good: let an LLM read the failures, reason about what went wrong, and propose a better prompt.

That's it. That's the entire field.

GEPA: the optimizer that won

GEPA (Genetic-Pareto optimization) has become the default prompt optimizer to reach for. It was published as an ICLR 2026 Oral and now lives inside DSPy as dspy.GEPA.

Why it won: it outperforms reinforcement learning approaches by up to 20% and the previous best optimizer (MIPROv2) by 13% on aggregate, while using 35 times fewer evaluations. Better results, way cheaper. That combination tends to end debates.

What makes GEPA different is the Pareto frontier. Most optimizers maintain one "best" prompt and try to improve it. That sounds reasonable until you realize your inputs are heterogeneous. In my declarability classifier at Reg4U, a clinical note about a phone consult looks nothing like a lab result or a mental health intake. A single prompt can't be equally good at all of them.

GEPA maintains a frontier of candidate prompts, each specializing in different input types. In every iteration, it picks a candidate, runs it on a sample, reads the execution traces, and uses an LLM to reason about why specific predictions failed. Then it proposes a targeted improvement. It's not random mutation. It's reflective optimization.

The other options

MIPROv2 (paper) is the alternative worth knowing. It searches over both instructions and few-shot demonstrations using Bayesian optimization. Choose MIPROv2 when examples matter. Clinical text classification, for instance, benefits from showing the model what correct output looks like. In practice, many teams run MIPROv2 first (fast, demo-driven) and then GEPA on top for the cases MIPROv2 misses.

TextGrad (paper) introduced the concept of "textual gradients": using an LLM to generate feedback that functions like a gradient. Conceptually elegant, but in head-to-heads it's been outclassed by GEPA on heterogeneous tasks. TextGrad still works when your inputs are uniform. For most real-world problems, they're not.

What actually matters in practice

Feed the optimizer text, not just numbers. This is the single biggest lever. GEPA's reflective model needs to understand why predictions fail. Don't just return a 0 or 1 from your eval metric. Return a short explanation: "predicted declarabel but the note says 'telefonisch overleg met collega,' which is a niet-declarabel marker." That explanation flows into the reflection step and dramatically improves sample efficiency.

Separate the what from the how. Define what your module should do separately from how it does it. When your specification and implementation are decoupled, swapping GEPA for MIPROv2, or re-optimizing when a new model drops, is a configuration change, not a rewrite.

Watch your train/val/test split. Prompt optimizers overfit. GEPA's Pareto frontier mitigates this somewhat, but the failure mode where an optimizer memorizes quirks of your training set is real. It's especially easy to miss because your metric keeps going up during optimization. Hold out a true test set and only evaluate on it after you've committed to a prompt.

Don't over-optimize cheap models. Sometimes the answer is Claude Opus with a basic prompt, not Haiku with 50 hours of optimization. Run both as baselines before investing in optimization. The goal is the best outcome for the cost, not the most sophisticated prompt.

The takeaway

If you have evals (and after the last article, you should) you already have everything you need to stop guessing. Plug your eval set into an optimizer, let it run, and compare the result to your handwritten prompt on a held-out test set.

The prompt you spent hours crafting? The optimizer will probably beat it in twenty minutes. And unlike your intuition, it gets better every time you add data.

Let's talk.

Curious how this fits your product?

Book a call