Baselines

A baseline is a JSON file containing expected behavioral patterns for an agent. It stores statistics per intent (distinct execution pattern) from a rolling window of successful runs.

Baseline structure

{
  "agent_id": "my_agent",
  "updated_at": "2024-01-01T00:00:00Z",
  "intents": {
    "a1b2c3d4e5f6...": {
      "fingerprint_hash": "a1b2c3d4...",
      "accepted_fingerprints": ["a1b2c3d4...", "e5f6g7h8..."],
      "step_count": 5,
      "avg_tokens": 1500.0,
      "avg_latency": 250.0,
      "latency_bounds": {
        "min_ms": 150, "p5_ms": 160, "p50_ms": 230,
        "p95_ms": 400, "max_ms": 500, "avg_ms": 250
      },
      "sample_count": 10,
      "tool_call_counts": {"search": 2, "summarize": 1},
      "error_rate": 0.0
    }
  }
}

Generating baselines

spooled ci update-baseline \
    --from .spooled/traces/ \
    --out baselines/ \
    --min-runs 3

The --min-runs flag ensures each intent has enough samples for stable bounds. The baseline manager uses a rolling window (default: last 10 successful runs per intent, configurable via SPOOLED_BASELINE_WINDOW_SIZE, max 100).

Comparison statuses

StatusDescriptionAction
matchFingerprint exactly matches a baseline intentPass — expected behavior
accepted_variantMatches an accepted fingerprint variantPass — previously approved change
variantSimilar (≥75%) but not exact matchReview — may need acceptance or fix
newStructurally different from all intentsReview — new execution pattern
no_baselineNo baseline exists for this agentGenerate a baseline first
structural_matchMatches via IO-schema hash (rename-resilient)Pass — same structure, different names

Accepting variants

When a variant is intentional (e.g., after a prompt update), accept it:

spooled ci accept-variant \
    --intent a1b2c3d4 \
    --fingerprint e5f6g7h8 \
    --baseline baselines/my_agent.json \
    --reason "prompt update" \
    --by "andy"

This adds the fingerprint to the intent's accepted_fingerprints list with provenance metadata (who accepted, when, why).

Git workflow

Baselines should be committed to git alongside your code:

  • Review baseline changes in PRs — see exactly what behavioral patterns changed
  • Diff baselines across branches
  • Version behavior with the same rigor as code
  • Rollback baselines with git revert

Spooled Score

A composite 0–100 stability metric computed from four weighted components:

ComponentWeight
Structural (fingerprint match)40%
Signal health (detected signals)25%
Metric stability (latency/tokens)20%
Trend consistency (last 10 comparisons)15%

Grades: A (90–100), B (75–89), C (60–74), D (40–59), F (0–39).