CI/CD Integration

Run behavioral testing on every pull request. Spooled compares agent execution traces against baselines and blocks merges on policy violations.

GitHub Action

.github/workflows/agent-ci.yml

name: Agent CI
on: [pull_request]

jobs:
  behavioral-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: spooled-ai/action@v1
        with:
          license-key: ${{ secrets.SPOOLED_API_KEY }}
          baselines: .github/baselines
          test-command: python -m pytest tests/agents/
          policy: spooled-policy.yml
          blocking: true
          post-comment: true

Action inputs

Input	Default	Description
license-key	""	Pro license key. Only needed for backend features (--push). All comparison features work without it.
baselines	.github/baselines	Path to baseline directory or file
test-command	""	Command to generate traces (leave empty if traces already exist)
trace-dir	.spooled/traces	Directory containing trace JSONL files
policy	""	Path to policy YAML file
blocking	true	Fail the check on policy violations
setup-python	true	Set up Python automatically
python-version	3.10	Python version to install
extra-deps	""	Path to additional requirements.txt
post-comment	true	Post or update a PR comment with the report
push-report	false	Push CI report to Spooled backend
spooled-version	""	Specific PyPI version to install

Action outputs

Output	Description
result	PASS or FAIL
total	Total traces analyzed
passed	Traces matching baseline
new-behavior	Traces with new fingerprints
policy-failures	Traces with policy violations
report-path	Path to generated report.md
summary-path	Path to generated summary.json

CLI commands

For non-GitHub CI systems, use the CLI directly:

spooled ci run

Full orchestration: run tests → capture traces → compare → report.

spooled ci run \
    --suite tests/agents/ \
    --baseline baselines/ \
    --policy spooled-policy.yml \
    --out ci-report/

spooled ci compare

Compare a single trace against a baseline (free tier):

spooled ci compare .spooled/traces/agent-runid.jsonl \
    --baseline baselines/my_agent.json \
    --agent-id my_agent

spooled ci batch-compare

Compare all traces in a directory against baselines:

spooled ci batch-compare \
    --trace-dir .spooled/traces/ \
    --baseline baselines/ \
    --policy spooled-policy.yml

Exit codes

Code	Meaning
0	Pass — all agents matched or no violations
1	Blocked — policy violation detected
2	Runtime error — configuration or execution failure

Baseline workflow

The typical CI workflow:

First run — no baseline exists, all traces are no_baseline. Generate one.
Normal PR — traces match baseline, CI passes.
Regression — trace is a variant, policy blocks the merge.
Intentional change — accept the variant, update the baseline, commit to git.

Two-pass verification

Metric-sensitive signals (latency, tokens) can vary between runs. Use --retries 2 for 2-pass verification:

spooled ci run --suite tests/ --baseline baselines/ --retries 2

First pass detects structural and signal issues. If only metric-sensitive signals fired, the agent is rerun and metric signals are re-evaluated on the second pass.

Tip

Commit your baselines to .github/baselines/ and review baseline changes in PRs — just like code changes.

CI integration patterns

AI agents are non-deterministic and expensive to run. There are five ways to integrate Spooled into CI, each with different cost and reliability tradeoffs.

Pattern 1: Real LLM calls in CI

The CI runner executes your agent with real LLM API calls on every PR. Traces are generated fresh and compared against baselines.

# .github/workflows/spooled.yml
- name: Generate traces
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: python ci_runner.py

- uses: spooled-ai/action@v1
  with:
    baselines: .github/baselines

	Details
Cost per PR	$0.01–$5.00 depending on model and agent complexity
Catches behavioral drift?	Yes — real LLM responses drive real execution paths
Flakiness risk	High — non-determinism, rate limits, network timeouts
Latency risk	High — baseline generated on Mac, CI on Linux = false positive
Best for	Prototyping, cheap models, getting started

Warning

If your baseline was generated locally and CI runs on a different machine, latency signals will fire as false positives. Spooled detects this and downgrades latency signals to info severity with an[env mismatch] annotation.

Pattern 2: Replay with mocks

Use spooled replay to generate a pytest file that replays a captured trace with mocked LLM responses.

spooled replay .spooled/traces/abc-123.jsonl
pytest replay_abc_123.py

	Details
Cost per PR	$0 — no LLM calls
Catches behavioral drift?	No — mocked responses lock in the baseline behavior
Best for	Code regression testing only (e.g., tool function signature changed)

Warning

Replay does not catch behavioral drift. If the model starts making different decisions, mocks will hide it. Use replay for code integration testing, not behavioral testing.

Pattern 3: Hybrid

Run mocked replays on every PR for fast feedback. Schedule real LLM runs nightly or weekly to catch behavioral drift.

	Details
Cost per PR	$0 (replay) + scheduled cost for nightly runs
Catches behavioral drift?	Yes — but with delayed detection (nightly, not per-PR)
Best for	Cost-sensitive teams who accept delayed regression alerts

Pattern 4: External trace generation (Pro)

Instrument your staging or production environment with spooled.init(). Traces flow to the Spooled backend continuously. At PR time, the action fetches recent traces and compares them against baselines — zero LLM calls in CI.

# Coming soon — Pro feature
- uses: spooled-ai/action@v1
  with:
    fetch-from-backend: true
    agent-id: deal_agent
    since: 24h
    baselines: .github/baselines

	Details
Cost per PR	$0 in CI — traces already captured in staging
Catches behavioral drift?	Yes — using real production/staging behavior
Flakiness risk	Zero — no network calls to LLM providers in CI
Latency risk	Zero — traces come from the same environment as baselines
Best for	Production teams, expensive models, regulated industries

Note

This is the recommended approach for production deployments. Requires a Pro plan for backend trace ingest.

Pattern 5: Sampled real calls

Run your agent against a small, fixed set of test inputs (3–5 scenarios) with real LLM calls. Keeps costs low while still catching behavioral drift.

# ci_runner.py — fixed test inputs
CI_COMPANIES = ["CloudMesh", "NovaPay", "GreenGrid Energy"]
CI_MODEL = "gpt-4o-mini"

for company in CI_COMPANIES:
    spooled.init(agent_id="deal_agent")
    run_agent(company, model=CI_MODEL)
    spooled.shutdown()

	Details
Cost per PR	$0.005–$0.50 depending on model
Catches behavioral drift?	Yes — real LLM responses, just fewer of them
Best for	Recommended default when Pattern 4 is not yet set up

Which pattern should I use?

Just getting started? Use Pattern 5 (sampled real calls) — cheapest real test.
Cost-sensitive team? Use Pattern 3 (hybrid) — replay on PR, real calls weekly.
Production deployment? Use Pattern 4 (external traces) — zero CI cost, real behavior.
Code regression only? Use Pattern 2 (replay) — fast, free, deterministic.