Baselines
A baseline is a JSON file containing expected behavioral patterns for an agent. It stores statistics per intent (distinct execution pattern) from a rolling window of successful runs.
Baseline structure
{
"agent_id": "my_agent",
"updated_at": "2024-01-01T00:00:00Z",
"intents": {
"a1b2c3d4e5f6...": {
"fingerprint_hash": "a1b2c3d4...",
"accepted_fingerprints": ["a1b2c3d4...", "e5f6g7h8..."],
"step_count": 5,
"avg_tokens": 1500.0,
"avg_latency": 250.0,
"latency_bounds": {
"min_ms": 150, "p5_ms": 160, "p50_ms": 230,
"p95_ms": 400, "max_ms": 500, "avg_ms": 250
},
"sample_count": 10,
"tool_call_counts": {"search": 2, "summarize": 1},
"error_rate": 0.0
}
}
}Generating baselines
spooled ci update-baseline \ --from .spooled/traces/ \ --out baselines/ \ --min-runs 3
The --min-runs flag ensures each intent has enough samples for stable bounds. The baseline manager uses a rolling window (default: last 10 successful runs per intent, configurable via SPOOLED_BASELINE_WINDOW_SIZE, max 100).
Comparison statuses
| Status | Description | Action |
|---|---|---|
| match | Fingerprint exactly matches a baseline intent | Pass — expected behavior |
| accepted_variant | Matches an accepted fingerprint variant | Pass — previously approved change |
| variant | Similar (≥75%) but not exact match | Review — may need acceptance or fix |
| new | Structurally different from all intents | Review — new execution pattern |
| no_baseline | No baseline exists for this agent | Generate a baseline first |
| structural_match | Matches via IO-schema hash (rename-resilient) | Pass — same structure, different names |
Accepting variants
When a variant is intentional (e.g., after a prompt update), accept it:
spooled ci accept-variant \ --intent a1b2c3d4 \ --fingerprint e5f6g7h8 \ --baseline baselines/my_agent.json \ --reason "prompt update" \ --by "andy"
This adds the fingerprint to the intent's accepted_fingerprints list with provenance metadata (who accepted, when, why).
Git workflow
Baselines should be committed to git alongside your code:
- Review baseline changes in PRs — see exactly what behavioral patterns changed
- Diff baselines across branches
- Version behavior with the same rigor as code
- Rollback baselines with git revert
Spooled Score
A composite 0–100 stability metric computed from four weighted components:
| Component | Weight |
|---|---|
| Structural (fingerprint match) | 40% |
| Signal health (detected signals) | 25% |
| Metric stability (latency/tokens) | 20% |
| Trend consistency (last 10 comparisons) | 15% |
Grades: A (90–100), B (75–89), C (60–74), D (40–59), F (0–39).