Prompt engineering in production is fundamentally different from prompt engineering in a playground. In a playground, you iterate until the output looks right. In production, you need the output to be right every time, across thousands of variations of input, with structured responses that downstream code can parse reliably. After integrating LLM features into eight client applications — from intelligent search to automated content moderation to document summarisation — we have developed a set of patterns that bridge the gap between demo and production.
The most important pattern is structured output enforcement. Never rely on the LLM to return well-formatted JSON by asking nicely in the prompt. Instead, use schema-based validation. With Claude, we use tool_use with a defined JSON schema that the model must conform to. With OpenAI, we use function calling or the structured outputs feature. The response is then validated against the schema using Zod on the server before any processing. If validation fails, the request is retried with a clarifying prompt that includes the validation error. This pattern has reduced our parse-failure rate from roughly 5% to under 0.1%.
Prompt versioning and testing are essential. We store prompts as versioned template files, not inline strings in application code. Each prompt has a test suite with representative inputs and expected outputs, run as part of our CI pipeline. The tests are not exact-match — they use semantic similarity scoring and structural assertions (does the output contain the required fields? is the sentiment classification one of the allowed values?) to accommodate the inherent variability of LLM outputs. When we update a prompt, the test suite tells us immediately if the change has regressed any edge cases.
Cost management in production is a real concern. A naive implementation that sends the full document to the LLM for every request can burn through API credits quickly. We use a layered approach: fast, cheap operations (regex matching, keyword detection, embedding similarity) act as a first filter, and only requests that genuinely need LLM reasoning are sent to the model. For a content moderation system, this reduced our API costs by 80% — the LLM only evaluates the 15% of content that the rule-based filter cannot confidently classify. Combined with response caching for identical inputs and model routing (using Haiku for simple tasks and Opus for complex ones), we keep per-user LLM costs under 2 rupees per month.