CI/CD Integration

Run behavioral testing on every pull request. Spooled compares agent execution traces against baselines and blocks merges on policy violations.

GitHub Action

.github/workflows/agent-ci.yml
name: Agent CI
on: [pull_request]

jobs:
  behavioral-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: spooled-ai/action@v1
        with:
          license-key: ${{ secrets.SPOOLED_API_KEY }}
          baselines: .github/baselines
          test-command: python -m pytest tests/agents/
          policy: spooled-policy.yml
          blocking: true
          post-comment: true

Action inputs

InputDefaultDescription
license-key""Pro license key. Only needed for backend features (--push). All comparison features work without it.
baselines.github/baselinesPath to baseline directory or file
test-command""Command to generate traces (leave empty if traces already exist)
trace-dir.spooled/tracesDirectory containing trace JSONL files
policy""Path to policy YAML file
blockingtrueFail the check on policy violations
setup-pythontrueSet up Python automatically
python-version3.10Python version to install
extra-deps""Path to additional requirements.txt
post-commenttruePost or update a PR comment with the report
push-reportfalsePush CI report to Spooled backend
spooled-version""Specific PyPI version to install

Action outputs

OutputDescription
resultPASS or FAIL
totalTotal traces analyzed
passedTraces matching baseline
new-behaviorTraces with new fingerprints
policy-failuresTraces with policy violations
report-pathPath to generated report.md
summary-pathPath to generated summary.json

CLI commands

For non-GitHub CI systems, use the CLI directly:

spooled ci run

Full orchestration: run tests → capture traces → compare → report.

spooled ci run \
    --suite tests/agents/ \
    --baseline baselines/ \
    --policy spooled-policy.yml \
    --out ci-report/

spooled ci compare

Compare a single trace against a baseline (free tier):

spooled ci compare .spooled/traces/agent-runid.jsonl \
    --baseline baselines/my_agent.json \
    --agent-id my_agent

spooled ci batch-compare

Compare all traces in a directory against baselines:

spooled ci batch-compare \
    --trace-dir .spooled/traces/ \
    --baseline baselines/ \
    --policy spooled-policy.yml

Exit codes

CodeMeaning
0Pass — all agents matched or no violations
1Blocked — policy violation detected
2Runtime error — configuration or execution failure

Baseline workflow

The typical CI workflow:

  • First run — no baseline exists, all traces are no_baseline. Generate one.
  • Normal PR — traces match baseline, CI passes.
  • Regression — trace is a variant, policy blocks the merge.
  • Intentional change — accept the variant, update the baseline, commit to git.

Two-pass verification

Metric-sensitive signals (latency, tokens) can vary between runs. Use --retries 2 for 2-pass verification:

spooled ci run --suite tests/ --baseline baselines/ --retries 2

First pass detects structural and signal issues. If only metric-sensitive signals fired, the agent is rerun and metric signals are re-evaluated on the second pass.

Tip
Commit your baselines to .github/baselines/ and review baseline changes in PRs — just like code changes.

CI integration patterns

AI agents are non-deterministic and expensive to run. There are five ways to integrate Spooled into CI, each with different cost and reliability tradeoffs.

Pattern 1: Real LLM calls in CI

The CI runner executes your agent with real LLM API calls on every PR. Traces are generated fresh and compared against baselines.

# .github/workflows/spooled.yml
- name: Generate traces
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: python ci_runner.py

- uses: spooled-ai/action@v1
  with:
    baselines: .github/baselines
Details
Cost per PR$0.01–$5.00 depending on model and agent complexity
Catches behavioral drift?Yes — real LLM responses drive real execution paths
Flakiness riskHigh — non-determinism, rate limits, network timeouts
Latency riskHigh — baseline generated on Mac, CI on Linux = false positive
Best forPrototyping, cheap models, getting started
Warning
If your baseline was generated locally and CI runs on a different machine, latency signals will fire as false positives. Spooled detects this and downgrades latency signals to info severity with an[env mismatch] annotation.

Pattern 2: Replay with mocks

Use spooled replay to generate a pytest file that replays a captured trace with mocked LLM responses.

spooled replay .spooled/traces/abc-123.jsonl
pytest replay_abc_123.py
Details
Cost per PR$0 — no LLM calls
Catches behavioral drift?No — mocked responses lock in the baseline behavior
Best forCode regression testing only (e.g., tool function signature changed)
Warning
Replay does not catch behavioral drift. If the model starts making different decisions, mocks will hide it. Use replay for code integration testing, not behavioral testing.

Pattern 3: Hybrid

Run mocked replays on every PR for fast feedback. Schedule real LLM runs nightly or weekly to catch behavioral drift.

Details
Cost per PR$0 (replay) + scheduled cost for nightly runs
Catches behavioral drift?Yes — but with delayed detection (nightly, not per-PR)
Best forCost-sensitive teams who accept delayed regression alerts

Pattern 4: External trace generation (Pro)

Instrument your staging or production environment with spooled.init(). Traces flow to the Spooled backend continuously. At PR time, the action fetches recent traces and compares them against baselines — zero LLM calls in CI.

# Coming soon — Pro feature
- uses: spooled-ai/action@v1
  with:
    fetch-from-backend: true
    agent-id: deal_agent
    since: 24h
    baselines: .github/baselines
Details
Cost per PR$0 in CI — traces already captured in staging
Catches behavioral drift?Yes — using real production/staging behavior
Flakiness riskZero — no network calls to LLM providers in CI
Latency riskZero — traces come from the same environment as baselines
Best forProduction teams, expensive models, regulated industries
Note
This is the recommended approach for production deployments. Requires a Pro plan for backend trace ingest.

Pattern 5: Sampled real calls

Run your agent against a small, fixed set of test inputs (3–5 scenarios) with real LLM calls. Keeps costs low while still catching behavioral drift.

# ci_runner.py — fixed test inputs
CI_COMPANIES = ["CloudMesh", "NovaPay", "GreenGrid Energy"]
CI_MODEL = "gpt-4o-mini"

for company in CI_COMPANIES:
    spooled.init(agent_id="deal_agent")
    run_agent(company, model=CI_MODEL)
    spooled.shutdown()
Details
Cost per PR$0.005–$0.50 depending on model
Catches behavioral drift?Yes — real LLM responses, just fewer of them
Best forRecommended default when Pattern 4 is not yet set up

Which pattern should I use?

  • Just getting started? Use Pattern 5 (sampled real calls) — cheapest real test.
  • Cost-sensitive team? Use Pattern 3 (hybrid) — replay on PR, real calls weekly.
  • Production deployment? Use Pattern 4 (external traces) — zero CI cost, real behavior.
  • Code regression only? Use Pattern 2 (replay) — fast, free, deterministic.