Experiments
A/B test agent configurations to find what works best.
Experiments let you A/B test different agent configurations to measure which version performs better. Change a prompt, split traffic, and let data decide.
Why experiment
Agent prompts are hard to get right. A change that seems better might:
- Increase token usage without improving quality
- Work well for one task type but worse for others
- Help senior developers but confuse junior ones
Experiments give you statistical confidence before rolling out changes.
Experiment lifecycle
1. Create
Define what you're testing:
crewkit experiments create rails-expertOr through the dashboard: Experiments > New Experiment.
Specify:
- Agent — Which agent to test
- Control — The current version
- Variant — Your proposed changes
- Traffic split — Percentage of sessions using the variant (default: 50%)
2. Run
Sessions are automatically assigned to control or variant based on the traffic split. crewkit tracks metrics for both groups:
- Session count
- Success rate
- Average cost
- Average turns
- Token usage
3. Measure
View results in the dashboard or CLI:
crewkit experiments metrics swift-amber-falconcrewkit calculates statistical significance (p-value) so you know when you have enough data to make a decision.
4. Decide
When results are significant:
crewkit experiments deploy swift-amber-falconThis promotes the winning variant to the project or organization level. The experiment is archived.
Metrics tracked
| Metric | Description |
|---|---|
session_count | Number of sessions per variant |
success_rate | Percentage of successful sessions |
avg_cost | Mean cost per session |
avg_turns | Mean conversation turns |
p95_cost | 95th percentile cost |
p_value | Statistical significance |
Managing experiments
crewkit experiments list # List all experiments
crewkit experiments show swift-amber-falcon # View details
crewkit experiments metrics swift-amber-falcon # View metrics
crewkit experiments deploy swift-amber-falcon # Deploy winner