Behavioral Fingerprinting
A behavioral fingerprint is a structural hash of an agent's execution pattern. It captures what the agent did — not what it said.
What's in a fingerprint
The fingerprint is computed from:
- Interaction type sequence — the ordered list of LLM_CALL, TOOL_CALL, HTTP_REQUEST, OTHER
- Tool name sequence — the ordered list of tools/models/endpoints called
- Step count — total number of interactions
These are hashed into a SHA-256 hex string. The hash version is v2 (tracked via _FINGERPRINT_HASH_VERSION to prevent silent invalidation across SDK updates).
Hashing modes
Set via SPOOLED_FINGERPRINT_MODE or the fingerprint_mode parameter.
sequence (default)
Order-sensitive. The hash is computed from the ordered sequence of interactions:
"v2|LLM_CALL:gpt-4|TOOL_CALL:search|TOOL_CALL:summarize|..." → SHA-256 → truncated to 16 hex characters
Interactions in the same parallel_group are sorted before hashing for determinism.
structural
Order-insensitive. Hashes the sorted unique set of interaction types and targets. Useful for ReAct-style agents where the loop count varies but the tool set is stable.
Intents
An intent is a distinct behavioral pattern identified by its fingerprint hash. A single agent can have multiple intents — for example, a customer support agent might have different tool sequences for returns vs. technical issues vs. order status.
Baselines store statistics per intent, so comparison is always intent-scoped.
Similarity scoring
When a fingerprint doesn't exactly match, Spooled computes detailed similarity across three dimensions:
| Dimension | Method | Description |
|---|---|---|
| tool_jaccard | Jaccard similarity | Overlap of tool sets (ignoring order) |
| sequence_lcs | Longest common subsequence | Preserved ordering of tool sequence |
| error_jaccard | Jaccard similarity | Overlap of error patterns |
The overall similarity is a weighted composite. If ≥75%, the trace is classified as a variant of the closest baseline intent. Below 75%, it's classified as new.
What fingerprinting catches
- Tools added or removed from the sequence
- Tool order changes (in sequence mode)
- Model swaps (e.g., gpt-4 → gpt-4o-mini)
- Pipeline steps added or removed
- Extra LLM calls or tool calls
What it doesn't catch
- Semantic quality of outputs (use evals for that)
- Model refusals (requires output content analysis)
- Retrieval relevance (see
retrieval_regressionsignal) - Prompt injection (out of scope — use guardrails)
Fingerprinting is complementary to output evals. It catches structural regressions that evals miss (and vice versa).