🤖Agent Policy Testing

Know which agent policy actually works.

Evals measure output quality. Decision Process measures business outcomes — resolution rate, escalation rate, override rate, and cost per task. Run controlled A/B tests between prompt versions, model variants, and routing rules, then read Bayesian results that tell you which policy to ship.

Experiment Templates

Ready-to-run experiments

Chain-of-Thought Prompt A/B

A chain-of-thought system prompt reduces escalation rate and improves resolution rate.

Conditions

  • baseline_policy (control)
  • cot_policy

Metrics

  • Resolution Rate (%)
  • Escalation Rate (%)
  • Cost per Task (USD)
Agent session+13.6pp resolution rate

Model Version Comparison

GPT-4o achieves higher resolution rate than GPT-4-turbo at acceptable latency.

Conditions

  • gpt4_turbo (control)
  • gpt4o

Metrics

  • Resolution Rate (%)
  • Latency p95 (ms)
  • Cost per Task (USD)
Task run−28% cost per task

Escalation Routing Rule A/B

Routing by confidence score reduces override rate vs. routing by category.

Conditions

  • category_routing (control)
  • confidence_routing

Metrics

  • Override Rate (%)
  • Resolution Rate (%)
  • Escalation Rate (%)
Agent session−7.3pp escalation rate

Temperature Multi-arm Comparison

Temperature 0.3 produces the best balance of resolution rate and override rate.

Conditions

  • temp_0.0 (control)
  • temp_0.3
  • temp_0.7

Metrics

  • Resolution Rate (%)
  • Override Rate (%)
  • User Satisfaction (1–5)
Task run+0.4 satisfaction pts

Worked Example

Chain-of-thought vs. baseline prompt across 847 support agent sessions

A support team runs two system prompt variants in parallel. The baseline policy uses a concise instruction set. The chain-of-thought (CoT) policy adds explicit reasoning steps before action. Sessions are randomly assigned at the start of each conversation. Primary metric: resolution rate. Secondary: escalation rate and cost per task.

Results: resolution_rate (%)

Baseline policy (control)

mean: 61.2%

95% CI: 57.8–64.6

CoT policy

mean: 74.8%

95% CI: 71.7–77.9

P(better) = 97%

The CoT policy improves resolution rate by +13.6 percentage points (d = +0.71, large effect) at 97% posterior probability. Escalation rate fell from 18.4% to 11.1% — the CoT prompt is doing more work without handing off. Cost per task increased by $0.018 on average, a small trade-off given the resolution lift. Recommendation: ship the CoT policy.

Run your first agent policy testing experiment

Private beta — tell us about your use case and we'll get you set up.