← all writing

Evals are the product. Everything else is a side effect.

After two years shipping LLM features in production, the only artifact worth trusting is a well-designed eval suite. Here is how Cuecoder structures them.

Two years ago Cuecoder shipped its first LLM feature. A week of prompt tuning, a Friday demo, and on Monday a customer asked why it was confidently wrong. There was no answer. There were no tests. There was no model of what “good” looked like for the system.

That moment is when evals stopped being a thing to do after shipping and started being the product itself. Everything else — the prompt, the retrieval strategy, the model choice, the UX — is a knob you turn against the eval. If the eval is bad, every other decision is taste.

The eval is the spec

In traditional software, you have a spec and you have tests. The spec describes what the system should do; the tests describe what it does today. For AI systems, the eval suite collapses both into a single artifact. There is no other source of truth.

That sounds dramatic but it’s the only framing that has survived contact with production. Specs without evals are vibes. Code without evals is a slot machine.

“Specs without evals are vibes. Code without evals is a slot machine.”

Three layers in every production suite

Every production eval suite Cuecoder runs has the same three layers:

  • Unit evals — small, fast, deterministic. Run on every PR.
  • Trajectory evals — multi-step agent traces. Run nightly.
  • Drift evals — production samples graded against historical baselines. Run continuously.

What a unit eval looks like

from cue import eval, score

@eval("pricing-agent.classify")
def test_classify_simple(model):
    out = model("How much is the enterprise plan?")
    assert score.contains(out, "enterprise")
    assert score.no_hallucinated_price(out)
    assert score.latency(out) < 800

Notice there’s no exact-string assertion. The assertions are properties of the output — it mentions the right plan, it doesn’t invent a price, it’s fast enough. The eval encodes what “correct” means in a way that survives prompt edits and model swaps.

Where this gets you

Once the eval suite is the product, the workflow inverts. You don’t ship a prompt change to see if it feels better. You ship a prompt change to see what the eval delta is. The eval delta is the answer. Everything else is interpretation.

Cuecoder is building Axon precisely to make this loop tighter for other teams. But you don’t need Axon to start — you need a one-page Python file, a CSV of inputs, and 30 minutes of discipline. Do it before you ship. Do it especially when you don’t want to.


If this resonated, Cuecoder sends one essay like this every Tuesday. Subscribe to The Cue.

← back to writing Subscribe to The Cue →