Cost Signals

Spooled detects when a PR is about to make your agent runs significantly more expensive — before merge, attributed to the specific behavioral change that caused the spike. This page covers the mechanism, how to tune it, and where it shows up.

Note

Cost detection, not optimization. Spooled catches cost regressions; it does not rewrite prompts, recommend caching strategies, or model swaps for you. The value is pre-merge attribution: when a run gets significantly more expensive, you find out from the PR comment instead of the end-of-month invoice.

Warning

Scope: cost_spike compares a run against the baseline for the same execution shape, so it fires on cost drift that keeps the structure intact — prompt bloat, longer context, extra iterations on the same model and tools. A change that alters the structure itself (a model swap, an added or removed tool) produces a new fingerprint with no prior cost baseline to compare against; those are caught by the behavioral diff, not by cost_spike. Use the two together, not cost alone.

How it works

Every LLM call captured by Spooled's SDK includes structural usage data — prompt_tokens, completion_tokens, cached_tokens (OpenAI) or input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens (Anthropic). The recorder reads the model name and these counts, looks up per-token pricing in spooled/pricing.py, and stamps a usd_costfield onto the interaction's structural metadata before the trace is persisted.

At baseline-comparison time, the cost_spike signal aggregates usd_cost across every LLM call in a trace, looks up the intent-scoped baseline average from RunStatistics.avg_cost, and fires when the current run exceeds the threshold:

current_total_usd > baseline_avg_cost_usd * (1 + threshold)

Default threshold is 0.30 (30% increase). Tunable per project via SPOOLED_COST_SPIKE_THRESHOLD env var or in spooled-policy.yml.

What gets captured

Field	Source	Where it ends up
model	LLM call request	interaction.input.model
prompt_tokens / input_tokens	Provider usage response	interaction.output.usage
completion_tokens / output_tokens	Provider usage response	interaction.output.usage
cached_tokens / cache_read_input_tokens	Provider usage response (when caching is used)	interaction.output.usage
cache_creation_input_tokens	Anthropic prompt-caching write	interaction.output.usage
usd_cost	Computed: tokens × pricing table	interaction.metadata.usd_cost

All of the above are structural metadata — no prompt content, no response content, no tool argument values. The usd_cost field is in DEFAULT_METADATA_KEYS so it passes the privacy enforcement layer in both lenient and strict mode.

Signal output

When cost_spike fires, the signal returns:

{
  "detected": true,
  "severity": "warning",
  "intent_id": "<sha256>",
  "baseline_avg_cost_usd": 0.0033,
  "current_cost_usd": 0.0381,
  "absolute_delta_usd": 0.0348,
  "increase_percent": 1047.1,
  "threshold_percent": 30.0,
  "message": "Cost spike: $0.0381 this run vs. baseline $0.0033 (+1047.1%, +$0.0348)"
}

The message field is what surfaces in the PR comment headline when cost_spike is the most significant signal that fired.

How it appears in PR comments

Two surfaces:

When cost spike is the primary signal

If no new tools were added and no new behavior pattern was detected, the PR comment leads with the cost regression:

> [!CAUTION]
> ## 💸 Cost regression detected: agent run cost +1047.1%
>
> This PR costs **$0.0381** per run vs. baseline **$0.0033** (+$0.0348, +1047.1%)
>
> No new tools added in this PR — the cost rose from prompt or iteration changes
> on the same model and tools. Review the per-trace details below.

When cost spike compounds with a tool change

When new tools were added and the cost rose, the tool-change headline leads (more concrete) with the cost as a tail line:

> [!CAUTION]
> ## 🚨 Merge blocked: agent now calls `issue_refund`
>
> This tool was **never observed in the baseline**. It appears in **2 of 5** traces
> in this PR (~40%).
>
> This PR costs **$0.0055** per run vs. baseline **$0.0035** (+$0.0020, +57.1%)

Pricing table

Default prices for OpenAI and Anthropic models (list pricing as of late 2025 / early 2026) ship in spooled/pricing.py. Caching rates are handled when the provider reports them. The table covers:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, o1, o1-mini, gpt-3.5-turbo, including dated variants (e.g., gpt-4o-2024-08-06)
Anthropic: claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus, claude-3-haiku, plus Claude 4.x placeholders that should be verified before procurement-grade marketing

Override pricing at runtime (e.g., for contracted enterprise rates):

# prices.json
{
  "gpt-4o": {
    "input_per_million": 1.875,
    "output_per_million": 7.50,
    "cached_input_per_million": 0.9375
  }
}

# Then run agents with:
SPOOLED_PRICING_OVERRIDE=./prices.json python my_agent.py

Override entries merge on top of defaults — only the models you list are replaced.

Tuning the threshold

The 30% default is intentionally lenient to avoid false positives during baseline stabilization. Recommended starting points:

Use case	Threshold	Rationale
Dev branch CI	0.50 (50%)	Tolerate cost variance; surface only large regressions
Main branch / pre-prod	0.30 (30%)	Default; flags meaningful changes worth reviewing
Production deploy gate	0.20 (20%)	Tight gate; any meaningful regression blocks
Cost-sensitive (high volume)	0.10 (10%)	Catch even small per-call regressions before they multiply

Set via env var or policy:

# Env var (per-run)
SPOOLED_COST_SPIKE_THRESHOLD=0.20

# Or in spooled-policy.yml
signals:
  cost_spike:
    threshold: 0.20
    severity: warning

Reproducible cost-regression data

The spooled-test-flow repo contains a reproducible experiment harness measuring how common edits affect agent cost. Results are committed at COST_VALIDATION_2026-06-08.md with raw per-run data in experiments/results/. Run it yourself:

git clone https://github.com/Haefner6/spooled-test-flow
cd spooled-test-flow
pip install spooled-ai openai python-dotenv
echo "OPENAI_API_KEY=sk-..." > .env
PYTHONPATH=. python experiments/run_cost_experiment.py --runs-per-variant 10

Estimated cost: ~$2–5 USD in API spend for the full N=10 × 2 companies × 6 cells matrix.

What this signal does not catch

Cost from an unknown model.If the model name isn't in the pricing table (and no override is set), usd_cost is nullfor that interaction. The signal can't flag cost regressions on models it can't price.
Cost from non-LLM operations.Vector DB queries, embedding API calls outside the SDK's hooked clients, and inference on self-hosted models all sit outside Spooled's capture surface.
Cost variance from non-deterministic outputs. Even with seed=42, multi-turn tool-calling agents have natural variance across runs. The intent-scoped baseline absorbs this in the stdev; the threshold needs to be tuned to your variance, not to 0%.
Pre-baseline runs (cold start). The signal silently declines when the intent has fewer than 1 baseline sample. Build a baseline first via spooled ci update-baseline.

token_usage_spike — same shape as cost_spike but measured in tokens. Useful when contract pricing means per-token cost varies but token volume is the operational metric. Default threshold 30%.
latency_spike — fires on response-time regressions. Often correlates with cost (more iterations = more latency = more tokens).
new_behavior_pattern— catches the structural change that caused the cost regression. Pair both for the full story: "the cost rose because the agent started calling X tool that it didn't before."