Behavioral Fingerprinting

A behavioral fingerprint is a structural hash of an agent's execution pattern. It captures what the agent did — not what it said.

What's in a fingerprint

The fingerprint is computed from:

  • Interaction type sequence — the ordered list of LLM_CALL, TOOL_CALL, HTTP_REQUEST, OTHER
  • Tool name sequence — the ordered list of tools/models/endpoints called
  • Step count — total number of interactions

These are hashed into a SHA-256 hex string. The hash version is v2 (tracked via _FINGERPRINT_HASH_VERSION to prevent silent invalidation across SDK updates).

Hashing modes

Set via SPOOLED_FINGERPRINT_MODE or the fingerprint_mode parameter.

sequence (default)

Order-sensitive. The hash is computed from the ordered sequence of interactions:

"v2|LLM_CALL:gpt-4|TOOL_CALL:search|TOOL_CALL:summarize|..."
 SHA-256  truncated to 16 hex characters

Interactions in the same parallel_group are sorted before hashing for determinism.

structural

Order-insensitive. Hashes the sorted unique set of interaction types and targets. Useful for ReAct-style agents where the loop count varies but the tool set is stable.

Intents

An intent is a distinct behavioral pattern identified by its fingerprint hash. A single agent can have multiple intents — for example, a customer support agent might have different tool sequences for returns vs. technical issues vs. order status.

Baselines store statistics per intent, so comparison is always intent-scoped.

Similarity scoring

When a fingerprint doesn't exactly match, Spooled computes detailed similarity across three dimensions:

DimensionMethodDescription
tool_jaccardJaccard similarityOverlap of tool sets (ignoring order)
sequence_lcsLongest common subsequencePreserved ordering of tool sequence
error_jaccardJaccard similarityOverlap of error patterns

The overall similarity is a weighted composite. If ≥75%, the trace is classified as a variant of the closest baseline intent. Below 75%, it's classified as new.

What fingerprinting catches

  • Tools added or removed from the sequence
  • Tool order changes (in sequence mode)
  • Model swaps (e.g., gpt-4 → gpt-4o-mini)
  • Pipeline steps added or removed
  • Extra LLM calls or tool calls

What it doesn't catch

  • Semantic quality of outputs (use evals for that)
  • Model refusals (requires output content analysis)
  • Retrieval relevance (see retrieval_regression signal)
  • Prompt injection (out of scope — use guardrails)

Fingerprinting is complementary to output evals. It catches structural regressions that evals miss (and vice versa).