Drift Signals

Spooled detects behavioral drift signals automatically from execution structure. No assertions to write.

Free-tier signals

new_behavior_pattern

Severity: info — Cold start: the fingerprint doesn't match any known intent in the baseline. Purely informational — the system is learning a new execution path. Not a blocker.

new_side_effects

Severity: medium — New tools detected that weren't in the baseline. Reports current_tools, baseline_tools, and the new_tools list.

latency_spikes

Severity: high — Average latency increased beyond threshold, or max latency exceeds 2× the p95 envelope. Also triggers on absolute thresholds: avg > 5000ms or max > 30000ms.

Env varDefault
SPOOLED_LATENCY_INCREASE_THRESHOLD50.0 (percent)
SPOOLED_LATENCY_MAX_MULTIPLIER2.0
SPOOLED_LATENCY_AVG_ABSOLUTE_MS5000.0
SPOOLED_LATENCY_MAX_ABSOLUTE_MS30000.0

retry_explosions

Severity: high (≥5) / medium (≥3) — Excessive consecutive retries of the same tool after errors. Detects genuine retry loops, not pagination or batch processing. Skips tools listed in SPOOLED_RETRY_EXEMPT_TOOLS.

Env varDefault
SPOOLED_RETRY_EXPLOSION_THRESHOLD5.0
SPOOLED_RETRY_WARNING_THRESHOLD3.0
SPOOLED_RETRY_EXEMPT_TOOLS"" (comma-separated)

error_increases

Severity: high — Error rate increased significantly versus baseline. Catches agents that silently start failing more often.

tool_usage_changes

Severity: medium — Tool call count changed by more than threshold. Detects over-calling or under-calling.

Env varDefault
SPOOLED_TOOL_USAGE_CHANGE_THRESHOLD50.0 (percent)

token_usage_spike

Severity: high — Token consumption increased beyond threshold. Catches prompt bloat or model output changes.

Env varDefault
SPOOLED_TOKEN_USAGE_SPIKE_THRESHOLD0.5 (50%)

component_latency_drift

Severity: medium — Latency increased for a specific component (e.g., llm:gpt-4, tool:search, http:api.example.com). Pinpoints the source of slowdowns.

Env varDefault
SPOOLED_COMPONENT_LATENCY_DRIFT_THRESHOLD1.0 (100%)

tool_overuse

Severity: medium — A tool is being called more than necessary. Detects redundant or circular invocations.

Env varDefault
SPOOLED_TOOL_OVERUSE_THRESHOLD0.5

retrieval_regression

Severity: high — RAG retrieval quality degraded. Monitors functions listed in SPOOLED_RETRIEVAL_FUNCTIONS.

Env varDefault
SPOOLED_RETRIEVAL_REGRESSION_THRESHOLD0.2 (20%)
SPOOLED_RETRIEVAL_FUNCTIONS"retrieve,vector_search,search,query"

content_filter_rate_change

Severity: medium — The rate of content_filter finish reasons changed versus baseline. Detects when your model starts filtering more (or fewer) requests — without reading what was filtered.

Env varDefault
SPOOLED_CONTENT_FILTER_RATE_CHANGE_THRESHOLD0.1 (10%)

sequence_drift

Severity: info — Tool execution order differs from baseline, but the tool set is the same (structural fingerprint matches). Purely informational — ordering variance is expected for LLM-based agents.

Content-blind signals (Pro)

Note
These signals require a Pro or Team plan. They work without reading your content — only structural patterns.

output_schema_drift

Severity: high/medium — Output field schema changed — fields added or removed from tool responses. Catches breaking API contract changes.

Env varDefault
SPOOLED_OUTPUT_SCHEMA_FAIL_ON_REMOVED"true"
SPOOLED_OUTPUT_SCHEMA_FAIL_ON_ADDED"false"

Metric-sensitive signals

These signals are non-deterministic (latency/tokens vary between runs). In 2-pass CI mode (--retries 2), they are skipped on the first pass and only evaluated on the rerun:

METRIC_SENSITIVE_SIGNALS = {
    "latency_spikes",
    "token_usage_spike",
    "component_latency_drift",
}

Spooled Score

A composite 0–100 stability metric with four weighted components:

ComponentWeightDescription
Structural40%Fingerprint similarity to baseline (match=100, variant=similarity×100, new=0)
Signal25%Health based on detected signals (100 minus penalties: high=-25, medium=-12, low=-5)
Metric20%Stability of latency and token usage
Trend15%Consistency of recent comparison history (last 10 data points)

Grades: A (90–100), B (75–89), C (60–74), D (40–59), F (0–39).

Confidence: high (≥5 trend points), medium (2–4), low (<2 or cold start).