Prompt Management in Production
Platform
Prompts are config. Treat them like feature flags — versioned, rollbackable.
Tpm
Prompt changes = product changes. Include in release notes and QA.
Devops
Same CI/CD for prompts. Lint, test, deploy. No ad-hoc edits in production.
Prompt Management in Production
TL;DR
- Prompts are code. Version them in Git, test them, deploy via your normal pipeline.
- One typo can tank output quality. One "improvement" can double latency. Track changes.
- Monitor: latency, token usage, error rate, and — if possible — output quality (sampling, heuristics).
"Just update the prompt" is how production breaks. Here's how to do it right.
Why Prompts Need Discipline
- Fragility. "Summarize" vs "Summarise" can change behavior. Extra line breaks matter.
- Drift. Someone edits the staging prompt. Forgets to sync prod. Two systems, different behavior.
- Cost. Longer prompts = more input tokens = higher cost. Unchecked prompt growth bleeds money.
- No rollback. If a prompt change degrades quality, how do you revert? Hope you have a backup?
Prompts as Code
Store prompts in your repo. Not in the LLM provider's dashboard. Not in a random doc.
prompts/
summarize_ticket.yaml # or .json, .md — pick one
suggest_labels.yaml
faq_answer.yaml
Each file: the prompt template + metadata (model, temperature, max_tokens).
Use a template engine (Jinja, Mustache) to inject variables: {{context}}, {{question}}.
Versioning
- Git. Every prompt change is a commit. You get history, diff, blame.
- Tag releases. "v1.2 of summarize_ticket" = a specific commit.
- Env-specific. Staging can point to
main; prod to a tagged release. No surprises.
Testing Prompts
- Unit test the template. Does it render? No undefined variables? No broken syntax?
- Regression suite. Golden set: 10–50 (input, expected_output) pairs. Run before deploy. If outputs drift, flag it.
- A/B or shadow. New prompt runs in shadow; compare to current. Roll out only if metrics improve.
Golden sets are imperfect — LLMs are non-deterministic. Use temperature=0 for tests, or relax "expected" to "contains key phrases" / "satisfies rubric."
Monitoring
| Metric | Why |
|---|---|
| Latency (p50, p99) | Prompt length and model choice affect it. Spikes = problems |
| Token usage (in/out) | Cost control. Catch runaway prompts |
| Error rate | API failures, timeouts |
| Output length | Sudden change = prompt or model change |
| Quality (sampling) | Manual review of 1% of outputs. Expensive but catches subtle regressions |
Rollback
If a prompt goes bad:
- Revert the commit. Redeploy.
- Or: feature flag to old prompt. Instant switch without full deploy.
# prompts/summarize_ticket.yaml
system: |
You are a support ticket summarizer. Be concise. Output JSON only.
template: |
Summarize this ticket in 2-3 sentences. Include: customer issue, priority signal.
Ticket: {{ ticket_text }}
variables: [ticket_text]
model: gpt-4o-mini
temperature: 0
max_tokens: 200Quick Check
A prompt change ships. Latency doubles. What should you have had in place?
Do This Next
- Audit. Where are your prompts today? Editor? Dashboard? Scattered in code?
- Extract one into a file. Add variables. Render it from code. Does it work?
- Add one golden test. Single (input, expected) pair. Run it in CI. Make it green.