Cost Signals
Spooled detects when a PR is about to make your agent runs significantly more expensive — before merge, attributed to the specific behavioral change that caused the spike. This page covers the mechanism, how to tune it, and where it shows up.
How it works
Every LLM call captured by Spooled's SDK includes structural usage data — prompt_tokens, completion_tokens, cached_tokens (OpenAI) or input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens (Anthropic). The recorder reads the model name and these counts, looks up per-token pricing in spooled/pricing.py, and stamps a usd_costfield onto the interaction's structural metadata before the trace is persisted.
At baseline-comparison time, the cost_spike signal aggregates usd_cost across every LLM call in a trace, looks up the intent-scoped baseline average from RunStatistics.avg_cost, and fires when the current run exceeds the threshold:
current_total_usd > baseline_avg_cost_usd * (1 + threshold)Default threshold is 0.30 (30% increase). Tunable per project via SPOOLED_COST_SPIKE_THRESHOLD env var or in spooled-policy.yml.
What gets captured
| Field | Source | Where it ends up |
|---|---|---|
| model | LLM call request | interaction.input.model |
| prompt_tokens / input_tokens | Provider usage response | interaction.output.usage |
| completion_tokens / output_tokens | Provider usage response | interaction.output.usage |
| cached_tokens / cache_read_input_tokens | Provider usage response (when caching is used) | interaction.output.usage |
| cache_creation_input_tokens | Anthropic prompt-caching write | interaction.output.usage |
| usd_cost | Computed: tokens × pricing table | interaction.metadata.usd_cost |
All of the above are structural metadata — no prompt content, no response content, no tool argument values. The usd_cost field is in DEFAULT_METADATA_KEYS so it passes the privacy enforcement layer in both lenient and strict mode.
Signal output
When cost_spike fires, the signal returns:
{
"detected": true,
"severity": "warning",
"intent_id": "<sha256>",
"baseline_avg_cost_usd": 0.0033,
"current_cost_usd": 0.0381,
"absolute_delta_usd": 0.0348,
"increase_percent": 1047.1,
"threshold_percent": 30.0,
"message": "Cost spike: $0.0381 this run vs. baseline $0.0033 (+1047.1%, +$0.0348)"
}The message field is what surfaces in the PR comment headline when cost_spike is the most significant signal that fired.
How it appears in PR comments
Two surfaces:
When cost spike is the primary signal
If no new tools were added and no new behavior pattern was detected, the PR comment leads with the cost regression:
> [!CAUTION] > ## 💸 Cost regression detected: agent run cost +1047.1% > > This PR costs **$0.0381** per run vs. baseline **$0.0033** (+$0.0348, +1047.1%) > > No new tools added in this PR — the cost rose from prompt, model, or iteration > changes. Review the per-trace details below.
When cost spike compounds with a tool change
When new tools were added and the cost rose, the tool-change headline leads (more concrete) with the cost as a tail line:
> [!CAUTION] > ## 🚨 Merge blocked: agent now calls `issue_refund` > > This tool was **never observed in the baseline**. It appears in **2 of 5** traces > in this PR (~40%). > > This PR costs **$0.0055** per run vs. baseline **$0.0035** (+$0.0020, +57.1%)
Pricing table
Default prices for OpenAI and Anthropic models (list pricing as of late 2025 / early 2026) ship in spooled/pricing.py. Caching rates are handled when the provider reports them. The table covers:
- OpenAI:
gpt-4o,gpt-4o-mini,gpt-4-turbo,o1,o1-mini,gpt-3.5-turbo, including dated variants (e.g.,gpt-4o-2024-08-06) - Anthropic:
claude-3-5-sonnet,claude-3-5-haiku,claude-3-opus,claude-3-haiku, plus Claude 4.x placeholders that should be verified before procurement-grade marketing
Override pricing at runtime (e.g., for contracted enterprise rates):
# prices.json { "gpt-4o": { "input_per_million": 1.875, "output_per_million": 7.50, "cached_input_per_million": 0.9375 } } # Then run agents with: SPOOLED_PRICING_OVERRIDE=./prices.json python my_agent.py
Override entries merge on top of defaults — only the models you list are replaced.
Tuning the threshold
The 30% default is intentionally lenient to avoid false positives during baseline stabilization. Recommended starting points:
| Use case | Threshold | Rationale |
|---|---|---|
| Dev branch CI | 0.50 (50%) | Tolerate cost variance; surface only large regressions |
| Main branch / pre-prod | 0.30 (30%) | Default; flags meaningful changes worth reviewing |
| Production deploy gate | 0.20 (20%) | Tight gate; any meaningful regression blocks |
| Cost-sensitive (high volume) | 0.10 (10%) | Catch even small per-call regressions before they multiply |
Set via env var or policy:
# Env var (per-run) SPOOLED_COST_SPIKE_THRESHOLD=0.20 # Or in spooled-policy.yml signals: cost_spike: threshold: 0.20 severity: warning
Reproducible cost-regression data
The spooled-test-flow repo contains a reproducible experiment harness measuring how common edits affect agent cost. Results are committed at COST_VALIDATION_2026-06-08.md with raw per-run data in experiments/results/. Run it yourself:
git clone https://github.com/Haefner6/spooled-test-flow cd spooled-test-flow pip install spooled-ai openai python-dotenv echo "OPENAI_API_KEY=sk-..." > .env PYTHONPATH=. python experiments/run_cost_experiment.py --runs-per-variant 10
Estimated cost: ~$2–5 USD in API spend for the full N=10 × 2 companies × 6 cells matrix.
What this signal does not catch
- Cost from an unknown model.If the model name isn't in the pricing table (and no override is set),
usd_costisnullfor that interaction. The signal can't flag cost regressions on models it can't price. - Cost from non-LLM operations.Vector DB queries, embedding API calls outside the SDK's hooked clients, and inference on self-hosted models all sit outside Spooled's capture surface.
- Cost variance from non-deterministic outputs. Even with
seed=42, multi-turn tool-calling agents have natural variance across runs. The intent-scoped baseline absorbs this in the stdev; the threshold needs to be tuned to your variance, not to 0%. - Pre-baseline runs (cold start). The signal silently declines when the intent has fewer than 1 baseline sample. Build a baseline first via
spooled ci update-baseline.
Related signals
- token_usage_spike — same shape as cost_spike but measured in tokens. Useful when contract pricing means per-token cost varies but token volume is the operational metric. Default threshold 30%.
- latency_spike — fires on response-time regressions. Often correlates with cost (more iterations = more latency = more tokens).
- new_behavior_pattern— catches the structural change that caused the cost regression. Pair both for the full story: "the cost rose because the agent started calling X tool that it didn't before."