CI/CD Integration
Run behavioral testing on every pull request. Spooled compares agent execution traces against baselines and blocks merges on policy violations.
GitHub Action
name: Agent CI on: [pull_request] jobs: behavioral-ci: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: spooled-ai/action@v1 with: license-key: ${{ secrets.SPOOLED_API_KEY }} baselines: .github/baselines test-command: python -m pytest tests/agents/ policy: spooled-policy.yml blocking: true post-comment: true
Action inputs
| Input | Default | Description |
|---|---|---|
| license-key | "" | Pro license key. Only needed for backend features (--push). All comparison features work without it. |
| baselines | .github/baselines | Path to baseline directory or file |
| test-command | "" | Command to generate traces (leave empty if traces already exist) |
| trace-dir | .spooled/traces | Directory containing trace JSONL files |
| policy | "" | Path to policy YAML file |
| blocking | true | Fail the check on policy violations |
| setup-python | true | Set up Python automatically |
| python-version | 3.10 | Python version to install |
| extra-deps | "" | Path to additional requirements.txt |
| post-comment | true | Post or update a PR comment with the report |
| push-report | false | Push CI report to Spooled backend |
| spooled-version | "" | Specific PyPI version to install |
Action outputs
| Output | Description |
|---|---|
| result | PASS or FAIL |
| total | Total traces analyzed |
| passed | Traces matching baseline |
| new-behavior | Traces with new fingerprints |
| policy-failures | Traces with policy violations |
| report-path | Path to generated report.md |
| summary-path | Path to generated summary.json |
CLI commands
For non-GitHub CI systems, use the CLI directly:
spooled ci run
Full orchestration: run tests → capture traces → compare → report.
spooled ci run \
--suite tests/agents/ \
--baseline baselines/ \
--policy spooled-policy.yml \
--out ci-report/spooled ci compare
Compare a single trace against a baseline (free tier):
spooled ci compare .spooled/traces/agent-runid.jsonl \ --baseline baselines/my_agent.json \ --agent-id my_agent
spooled ci batch-compare
Compare all traces in a directory against baselines:
spooled ci batch-compare \ --trace-dir .spooled/traces/ \ --baseline baselines/ \ --policy spooled-policy.yml
Exit codes
| Code | Meaning |
|---|---|
| 0 | Pass — all agents matched or no violations |
| 1 | Blocked — policy violation detected |
| 2 | Runtime error — configuration or execution failure |
Baseline workflow
The typical CI workflow:
- First run — no baseline exists, all traces are
no_baseline. Generate one. - Normal PR — traces match baseline, CI passes.
- Regression — trace is a variant, policy blocks the merge.
- Intentional change — accept the variant, update the baseline, commit to git.
Two-pass verification
Metric-sensitive signals (latency, tokens) can vary between runs. Use --retries 2 for 2-pass verification:
spooled ci run --suite tests/ --baseline baselines/ --retries 2
First pass detects structural and signal issues. If only metric-sensitive signals fired, the agent is rerun and metric signals are re-evaluated on the second pass.
.github/baselines/ and review baseline changes in PRs — just like code changes.CI integration patterns
AI agents are non-deterministic and expensive to run. There are five ways to integrate Spooled into CI, each with different cost and reliability tradeoffs.
Pattern 1: Real LLM calls in CI
The CI runner executes your agent with real LLM API calls on every PR. Traces are generated fresh and compared against baselines.
# .github/workflows/spooled.yml - name: Generate traces env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci_runner.py - uses: spooled-ai/action@v1 with: baselines: .github/baselines
| Details | |
|---|---|
| Cost per PR | $0.01–$5.00 depending on model and agent complexity |
| Catches behavioral drift? | Yes — real LLM responses drive real execution paths |
| Flakiness risk | High — non-determinism, rate limits, network timeouts |
| Latency risk | High — baseline generated on Mac, CI on Linux = false positive |
| Best for | Prototyping, cheap models, getting started |
info severity with an[env mismatch] annotation.Pattern 2: Replay with mocks
Use spooled replay to generate a pytest file that replays a captured trace with mocked LLM responses.
spooled replay .spooled/traces/abc-123.jsonl pytest replay_abc_123.py
| Details | |
|---|---|
| Cost per PR | $0 — no LLM calls |
| Catches behavioral drift? | No — mocked responses lock in the baseline behavior |
| Best for | Code regression testing only (e.g., tool function signature changed) |
Pattern 3: Hybrid
Run mocked replays on every PR for fast feedback. Schedule real LLM runs nightly or weekly to catch behavioral drift.
| Details | |
|---|---|
| Cost per PR | $0 (replay) + scheduled cost for nightly runs |
| Catches behavioral drift? | Yes — but with delayed detection (nightly, not per-PR) |
| Best for | Cost-sensitive teams who accept delayed regression alerts |
Pattern 4: External trace generation (Pro)
Instrument your staging or production environment with spooled.init(). Traces flow to the Spooled backend continuously. At PR time, the action fetches recent traces and compares them against baselines — zero LLM calls in CI.
# Coming soon — Pro feature - uses: spooled-ai/action@v1 with: fetch-from-backend: true agent-id: deal_agent since: 24h baselines: .github/baselines
| Details | |
|---|---|
| Cost per PR | $0 in CI — traces already captured in staging |
| Catches behavioral drift? | Yes — using real production/staging behavior |
| Flakiness risk | Zero — no network calls to LLM providers in CI |
| Latency risk | Zero — traces come from the same environment as baselines |
| Best for | Production teams, expensive models, regulated industries |
Pattern 5: Sampled real calls
Run your agent against a small, fixed set of test inputs (3–5 scenarios) with real LLM calls. Keeps costs low while still catching behavioral drift.
# ci_runner.py — fixed test inputs CI_COMPANIES = ["CloudMesh", "NovaPay", "GreenGrid Energy"] CI_MODEL = "gpt-4o-mini" for company in CI_COMPANIES: spooled.init(agent_id="deal_agent") run_agent(company, model=CI_MODEL) spooled.shutdown()
| Details | |
|---|---|
| Cost per PR | $0.005–$0.50 depending on model |
| Catches behavioral drift? | Yes — real LLM responses, just fewer of them |
| Best for | Recommended default when Pattern 4 is not yet set up |
Which pattern should I use?
- Just getting started? Use Pattern 5 (sampled real calls) — cheapest real test.
- Cost-sensitive team? Use Pattern 3 (hybrid) — replay on PR, real calls weekly.
- Production deployment? Use Pattern 4 (external traces) — zero CI cost, real behavior.
- Code regression only? Use Pattern 2 (replay) — fast, free, deterministic.