Behavioral Fingerprinting

A behavioral fingerprint is a structural hash of an agent's execution pattern. It captures what the agent did — not what it said.

What's in a fingerprint

The fingerprint is computed from:

Interaction type sequence — the ordered list of LLM_CALL, TOOL_CALL, HTTP_REQUEST, OTHER
Tool name sequence — the ordered list of tools/models/endpoints called
Step count — total number of interactions

These are hashed into a SHA-256 hex string. The hash version is v2 (tracked via _FINGERPRINT_HASH_VERSION to prevent silent invalidation across SDK updates).

Hashing modes

Set via SPOOLED_FINGERPRINT_MODE or the fingerprint_mode parameter.

sequence (default)

Order-sensitive. The hash is computed from the ordered sequence of interactions:

"v2|LLM_CALL:gpt-4|TOOL_CALL:search|TOOL_CALL:summarize|..."
→ SHA-256 → truncated to 16 hex characters

Interactions in the same parallel_group are sorted before hashing for determinism.

structural

Order-insensitive. Hashes the sorted unique set of interaction types and targets. Useful for ReAct-style agents where the loop count varies but the tool set is stable.

Intents

An intent is a distinct behavioral pattern identified by its fingerprint hash. A single agent can have multiple intents — for example, a customer support agent might have different tool sequences for returns vs. technical issues vs. order status.

Baselines store statistics per intent, so comparison is always intent-scoped.

Similarity scoring

When a fingerprint doesn't exactly match, Spooled computes detailed similarity across three dimensions:

Dimension	Method	Description
tool_jaccard	Jaccard similarity	Overlap of tool sets (ignoring order)
sequence_lcs	Longest common subsequence	Preserved ordering of tool sequence
error_jaccard	Jaccard similarity	Overlap of error patterns

The overall similarity is a weighted composite. If ≥75%, the trace is classified as a variant of the closest baseline intent. Below 75%, it's classified as new.

What fingerprinting catches

Tools added or removed from the sequence
Tool order changes (in sequence mode)
Model swaps (e.g., gpt-4 → gpt-4o-mini)
Pipeline steps added or removed
Extra LLM calls or tool calls

What it doesn't catch

Semantic quality of outputs (use evals for that)
Model refusals (requires output content analysis)
Retrieval relevance (see retrieval_regression signal)
Prompt injection (out of scope — use guardrails)

Fingerprinting is complementary to output evals. It catches structural regressions that evals miss (and vice versa).