crewkit
MarketplaceFeaturesPricingDocsGitHub
Sign InGet Started
crewkitInstallationQuickstartConfigurationTroubleshooting

Usage

TUICLIDashboardAnalytics

Configure

AgentsSkillsRulesCommandsInheritancePlaybooksExperimentsHooksTeamWorkspacesFAQ

API Reference

API OverviewAuthenticationSessions APIResources APIOrganizations APIProjects APIPlaybooks APIExperiments API

Experiments

A/B test agent configurations to find what works best.

Experiments let you A/B test different agent configurations to measure which version performs better. Change a prompt, split traffic, and let data decide.


Why experiment

Agent prompts are hard to get right. A change that seems better might:

  • Increase token usage without improving quality
  • Work well for one task type but worse for others
  • Help senior developers but confuse junior ones

Experiments give you statistical confidence before rolling out changes.


Experiment lifecycle

1. Create

Define what you're testing:

crewkit experiments create rails-expert

Or through the dashboard: Experiments > New Experiment.

Specify:

  • Agent — Which agent to test
  • Control — The current version
  • Variant — Your proposed changes
  • Traffic split — Percentage of sessions using the variant (default: 50%)

2. Run

Sessions are automatically assigned to control or variant based on the traffic split. crewkit tracks metrics for both groups:

  • Session count
  • Success rate
  • Average cost
  • Average turns
  • Token usage

3. Measure

View results in the dashboard or CLI:

crewkit experiments metrics swift-amber-falcon

crewkit calculates statistical significance (p-value) so you know when you have enough data to make a decision.

4. Decide

When results are significant:

crewkit experiments deploy swift-amber-falcon

This promotes the winning variant to the project or organization level. The experiment is archived.


Metrics tracked

MetricDescription
session_countNumber of sessions per variant
success_ratePercentage of successful sessions
avg_costMean cost per session
avg_turnsMean conversation turns
p95_cost95th percentile cost
p_valueStatistical significance

Managing experiments

crewkit experiments list                        # List all experiments
crewkit experiments show swift-amber-falcon     # View details
crewkit experiments metrics swift-amber-falcon  # View metrics
crewkit experiments deploy swift-amber-falcon   # Deploy winner

Next steps

  • Inheritance
  • Analytics

Playbooks

Team conventions and standards that guide AI behavior.

Hooks

Real-time event interception from Claude Code sessions.

On this page

Why experimentExperiment lifecycle1. Create2. Run3. Measure4. DecideMetrics trackedManaging experimentsNext steps
crewkit

Observability, governance, and continuous improvement for AI-assisted engineering teams.

Product

  • Marketplace
  • Features
  • Pricing
  • Documentation

Resources

  • GitHub
  • Changelog
  • Report Issue
  • System Status

Company

  • About
  • Privacy
  • Terms

© 2026 Karibew Inc. All rights reserved.

Command Palette

Search conversations, projects, playbooks, and more