Experiments

Experiments let you A/B test different agent configurations to measure which version performs better. Change a prompt, split traffic, and let data decide.

Why experiment

Agent prompts are hard to get right. A change that seems better might:

Increase token usage without improving quality
Work well for one task type but worse for others
Help senior developers but confuse junior ones

Experiments give you statistical confidence before rolling out changes.

Experiment lifecycle

1. Create

Define what you're testing:

crewkit experiments create rails-expert

Or through the dashboard: Experiments > New Experiment.

Specify:

Agent — Which agent to test
Control — The current version
Variant — Your proposed changes
Traffic split — Percentage of sessions using the variant (default: 50%)

2. Run

Sessions are automatically assigned to control or variant based on the traffic split. crewkit tracks metrics for both groups:

Session count
Success rate
Average cost
Average turns
Token usage

3. Measure

View results in the dashboard or CLI:

crewkit experiments metrics swift-amber-falcon

crewkit calculates statistical significance (p-value) so you know when you have enough data to make a decision.

4. Decide

When results are significant:

crewkit experiments deploy swift-amber-falcon

This promotes the winning variant to the project or organization level. The experiment is archived.

Metrics tracked

Metric	Description
`session_count`	Number of sessions per variant
`success_rate`	Percentage of successful sessions
`avg_cost`	Mean cost per session
`avg_turns`	Mean conversation turns
`p95_cost`	95th percentile cost
`p_value`	Statistical significance

Managing experiments

crewkit experiments list                        # List all experiments
crewkit experiments show swift-amber-falcon     # View details
crewkit experiments metrics swift-amber-falcon  # View metrics
crewkit experiments deploy swift-amber-falcon   # Deploy winner

Next steps

Experiments let you A/B test different agent configurations to measure which version performs better. Change a prompt, split traffic, and let data decide.

Why experiment

Agent prompts are hard to get right. A change that seems better might:

Increase token usage without improving quality
Work well for one task type but worse for others
Help senior developers but confuse junior ones

Experiments give you statistical confidence before rolling out changes.

Experiment lifecycle

1. Create

Define what you're testing:

crewkit experiments create rails-expert

Or through the dashboard: Experiments > New Experiment.

Specify:

Agent — Which agent to test
Control — The current version
Variant — Your proposed changes
Traffic split — Percentage of sessions using the variant (default: 50%)

2. Run

Sessions are automatically assigned to control or variant based on the traffic split. crewkit tracks metrics for both groups:

Session count
Success rate
Average cost
Average turns
Token usage

3. Measure

View results in the dashboard or CLI:

crewkit experiments metrics swift-amber-falcon

crewkit calculates statistical significance (p-value) so you know when you have enough data to make a decision.

4. Decide

When results are significant:

crewkit experiments deploy swift-amber-falcon

This promotes the winning variant to the project or organization level. The experiment is archived.

Metrics tracked

Metric	Description
`session_count`	Number of sessions per variant
`success_rate`	Percentage of successful sessions
`avg_cost`	Mean cost per session
`avg_turns`	Mean conversation turns
`p95_cost`	95th percentile cost
`p_value`	Statistical significance

Managing experiments

crewkit experiments list                        # List all experiments
crewkit experiments show swift-amber-falcon     # View details
crewkit experiments metrics swift-amber-falcon  # View metrics
crewkit experiments deploy swift-amber-falcon   # Deploy winner

Why experiment

Experiment lifecycle

1. Create

2. Run

3. Measure

4. Decide

Metrics tracked

Managing experiments

Next steps

On this page

Experiments

Why experiment

Experiment lifecycle

1. Create

2. Run

3. Measure

4. Decide

Metrics tracked

Managing experiments

Next steps

On this page

Experiments

On this page

Command Palette

Experiments

On this page

Command Palette