Drift Signals
Spooled detects behavioral drift signals automatically from execution structure. No assertions to write.
Free-tier signals
new_behavior_pattern
Severity: info — Cold start: the fingerprint doesn't match any known intent in the baseline. Purely informational — the system is learning a new execution path. Not a blocker.
new_side_effects
Severity: medium — New tools detected that weren't in the baseline. Reports current_tools, baseline_tools, and the new_tools list.
latency_spikes
Severity: high — Average latency increased beyond threshold, or max latency exceeds 2× the p95 envelope. Also triggers on absolute thresholds: avg > 5000ms or max > 30000ms.
| Env var | Default |
|---|---|
| SPOOLED_LATENCY_INCREASE_THRESHOLD | 50.0 (percent) |
| SPOOLED_LATENCY_MAX_MULTIPLIER | 2.0 |
| SPOOLED_LATENCY_AVG_ABSOLUTE_MS | 5000.0 |
| SPOOLED_LATENCY_MAX_ABSOLUTE_MS | 30000.0 |
retry_explosions
Severity: high (≥5) / medium (≥3) — Excessive consecutive retries of the same tool after errors. Detects genuine retry loops, not pagination or batch processing. Skips tools listed in SPOOLED_RETRY_EXEMPT_TOOLS.
| Env var | Default |
|---|---|
| SPOOLED_RETRY_EXPLOSION_THRESHOLD | 5.0 |
| SPOOLED_RETRY_WARNING_THRESHOLD | 3.0 |
| SPOOLED_RETRY_EXEMPT_TOOLS | "" (comma-separated) |
error_increases
Severity: high — Error rate increased significantly versus baseline. Catches agents that silently start failing more often.
tool_usage_changes
Severity: medium — Tool call count changed by more than threshold. Detects over-calling or under-calling.
| Env var | Default |
|---|---|
| SPOOLED_TOOL_USAGE_CHANGE_THRESHOLD | 50.0 (percent) |
token_usage_spike
Severity: high — Token consumption increased beyond threshold. Catches prompt bloat or model output changes.
| Env var | Default |
|---|---|
| SPOOLED_TOKEN_USAGE_SPIKE_THRESHOLD | 0.5 (50%) |
component_latency_drift
Severity: medium — Latency increased for a specific component (e.g., llm:gpt-4, tool:search, http:api.example.com). Pinpoints the source of slowdowns.
| Env var | Default |
|---|---|
| SPOOLED_COMPONENT_LATENCY_DRIFT_THRESHOLD | 1.0 (100%) |
tool_overuse
Severity: medium — A tool is being called more than necessary. Detects redundant or circular invocations.
| Env var | Default |
|---|---|
| SPOOLED_TOOL_OVERUSE_THRESHOLD | 0.5 |
retrieval_regression
Severity: high — RAG retrieval quality degraded. Monitors functions listed in SPOOLED_RETRIEVAL_FUNCTIONS.
| Env var | Default |
|---|---|
| SPOOLED_RETRIEVAL_REGRESSION_THRESHOLD | 0.2 (20%) |
| SPOOLED_RETRIEVAL_FUNCTIONS | "retrieve,vector_search,search,query" |
content_filter_rate_change
Severity: medium — The rate of content_filter finish reasons changed versus baseline. Detects when your model starts filtering more (or fewer) requests — without reading what was filtered.
| Env var | Default |
|---|---|
| SPOOLED_CONTENT_FILTER_RATE_CHANGE_THRESHOLD | 0.1 (10%) |
sequence_drift
Severity: info — Tool execution order differs from baseline, but the tool set is the same (structural fingerprint matches). Purely informational — ordering variance is expected for LLM-based agents.
Content-blind signals (Pro)
output_schema_drift
Severity: high/medium — Output field schema changed — fields added or removed from tool responses. Catches breaking API contract changes.
| Env var | Default |
|---|---|
| SPOOLED_OUTPUT_SCHEMA_FAIL_ON_REMOVED | "true" |
| SPOOLED_OUTPUT_SCHEMA_FAIL_ON_ADDED | "false" |
Metric-sensitive signals
These signals are non-deterministic (latency/tokens vary between runs). In 2-pass CI mode (--retries 2), they are skipped on the first pass and only evaluated on the rerun:
METRIC_SENSITIVE_SIGNALS = {
"latency_spikes",
"token_usage_spike",
"component_latency_drift",
}Spooled Score
A composite 0–100 stability metric with four weighted components:
| Component | Weight | Description |
|---|---|---|
| Structural | 40% | Fingerprint similarity to baseline (match=100, variant=similarity×100, new=0) |
| Signal | 25% | Health based on detected signals (100 minus penalties: high=-25, medium=-12, low=-5) |
| Metric | 20% | Stability of latency and token usage |
| Trend | 15% | Consistency of recent comparison history (last 10 data points) |
Grades: A (90–100), B (75–89), C (60–74), D (40–59), F (0–39).
Confidence: high (≥5 trend points), medium (2–4), low (<2 or cold start).