🤖Agent Policy Testing

Know which agent policy actually works.

Evals measure output quality. Decision Process measures business outcomes — resolution rate, escalation rate, override rate, and cost per task. Run controlled A/B tests between prompt versions, model variants, and routing rules, then read Bayesian results that tell you which policy to ship.

Experiment Templates

Ready-to-run experiments

Chain-of-Thought Prompt A/B

“A chain-of-thought system prompt reduces escalation rate and improves resolution rate.”

Conditions

baseline_policy (control)
cot_policy

Metrics

Resolution Rate (%)
Escalation Rate (%)
Cost per Task (USD)

Agent session+13.6pp resolution rate

Model Version Comparison

“GPT-4o achieves higher resolution rate than GPT-4-turbo at acceptable latency.”

Conditions

gpt4_turbo (control)
gpt4o

Metrics

Resolution Rate (%)
Latency p95 (ms)
Cost per Task (USD)

Task run−28% cost per task

Escalation Routing Rule A/B

“Routing by confidence score reduces override rate vs. routing by category.”

Conditions

category_routing (control)
confidence_routing

Metrics

Override Rate (%)
Resolution Rate (%)
Escalation Rate (%)

Agent session−7.3pp escalation rate

Temperature Multi-arm Comparison

“Temperature 0.3 produces the best balance of resolution rate and override rate.”

Conditions

temp_0.0 (control)
temp_0.3
temp_0.7

Metrics

Resolution Rate (%)
Override Rate (%)
User Satisfaction (1–5)

Task run+0.4 satisfaction pts

Worked Example

Chain-of-thought vs. baseline prompt across 847 support agent sessions

A support team runs two system prompt variants in parallel. The baseline policy uses a concise instruction set. The chain-of-thought (CoT) policy adds explicit reasoning steps before action. Sessions are randomly assigned at the start of each conversation. Primary metric: resolution rate. Secondary: escalation rate and cost per task.

Results: resolution_rate (%)

Baseline policy (control)

mean: 61.2%

95% CI: 57.8–64.6

CoT policy

mean: 74.8%

95% CI: 71.7–77.9

P(better) = 97%

The CoT policy improves resolution rate by +13.6 percentage points (d = +0.71, large effect) at 97% posterior probability. Escalation rate fell from 18.4% to 11.1% — the CoT prompt is doing more work without handing off. Cost per task increased by $0.018 on average, a small trade-off given the resolution lift. Recommendation: ship the CoT policy.

Explore other domains

🛒Retail & Commerce 📣Marketing & Growth 🏥Healthcare 🎓Education

Run your first agent policy testing experiment

Private beta — tell us about your use case and we'll get you set up.