Know which agent policy actually works.
Evals measure output quality. Decision Process measures business outcomes — resolution rate, escalation rate, override rate, and cost per task. Run controlled A/B tests between prompt versions, model variants, and routing rules, then read Bayesian results that tell you which policy to ship.
Experiment Templates
Ready-to-run experiments
Chain-of-Thought Prompt A/B
“A chain-of-thought system prompt reduces escalation rate and improves resolution rate.”
Conditions
- baseline_policy (control)
- cot_policy
Metrics
- Resolution Rate (%)
- Escalation Rate (%)
- Cost per Task (USD)
Model Version Comparison
“GPT-4o achieves higher resolution rate than GPT-4-turbo at acceptable latency.”
Conditions
- gpt4_turbo (control)
- gpt4o
Metrics
- Resolution Rate (%)
- Latency p95 (ms)
- Cost per Task (USD)
Escalation Routing Rule A/B
“Routing by confidence score reduces override rate vs. routing by category.”
Conditions
- category_routing (control)
- confidence_routing
Metrics
- Override Rate (%)
- Resolution Rate (%)
- Escalation Rate (%)
Temperature Multi-arm Comparison
“Temperature 0.3 produces the best balance of resolution rate and override rate.”
Conditions
- temp_0.0 (control)
- temp_0.3
- temp_0.7
Metrics
- Resolution Rate (%)
- Override Rate (%)
- User Satisfaction (1–5)
Worked Example
Chain-of-thought vs. baseline prompt across 847 support agent sessions
A support team runs two system prompt variants in parallel. The baseline policy uses a concise instruction set. The chain-of-thought (CoT) policy adds explicit reasoning steps before action. Sessions are randomly assigned at the start of each conversation. Primary metric: resolution rate. Secondary: escalation rate and cost per task.
Results: resolution_rate (%)
Baseline policy (control)
mean: 61.2%
95% CI: 57.8–64.6
CoT policy
mean: 74.8%
95% CI: 71.7–77.9
P(better) = 97%
The CoT policy improves resolution rate by +13.6 percentage points (d = +0.71, large effect) at 97% posterior probability. Escalation rate fell from 18.4% to 11.1% — the CoT prompt is doing more work without handing off. Cost per task increased by $0.018 on average, a small trade-off given the resolution lift. Recommendation: ship the CoT policy.
Explore other domains
Run your first agent policy testing experiment
Private beta — tell us about your use case and we'll get you set up.