There’s a point in every LLM project where you’ve squeezed everything out of the prompt. You’ve tried CoT, few-shot examples, XML tags, role-playing, constitutional constraints. The eval needle moved a little. Now it’s stuck.
This is the plateau. Most teams read it as a model limitation. It isn’t. It’s a retrieval limitation.
The prompt is just the container
A prompt is a vessel. What matters is what you put in it. The model’s knowledge is fixed — you can’t change it by rephrasing your instructions. What you can change is the context you ship with each call.
Context means: retrieved documents, tool outputs, structured state, dynamic memory. These are the actual levers. Prompt engineering is just shaping the container around them.
Retrieval is where the work is
The teams that ship the best LLM features have one thing in common: they have obsessed over retrieval quality. Not model size. Not prompt phrasing. Retrieval.
This means:
- Chunking strategy — how you split documents matters enormously. Semantic chunking beats fixed-size chunking for almost everything.
- Hybrid search — dense + sparse retrieval captures both semantic and keyword matches. Dropping either loses coverage.
- Reranking — a cross-encoder reranker on the top-k results from hybrid search consistently improves relevance. The cost is worth it.
- Context assembly — the order and framing of retrieved chunks in the prompt is itself a design decision.
Structured tools > implicit knowledge
When you’re asking the model to know things rather than look them up, you’ve already lost. The model’s implicit knowledge has a cutoff, varies by domain, and can’t be updated without retraining.
Tools are better. A get_current_price() tool is more reliable than asking the model to recall pricing. A search_documentation() tool is more reliable than hoping the model memorized your API docs.
Dynamic memory
For agents that run over sessions, memory architecture is the design problem. Cuecoder’s current approach:
- Working memory — the current conversation context. Managed by the agent.
- Episodic memory — retrieved from a vector store, keyed by session and relevance.
- Semantic memory — long-term facts, distilled from episodic memory by a background job.
None of this is prompt engineering. It’s infrastructure.
The next time you’re tempted to rewrite your system prompt, ask: what would better retrieval do for this? Usually the answer is: a lot more.