Skip to content

[Mitigation] Content provenance taint graph for prompt injection defense #2

@zeroaltitude

Description

@zeroaltitude

Proposal

A content provenance tracking system ("taint graph") that traces the origin and trust level of every piece of content flowing through an agent turn. Instead of relying solely on XML markers (<<<EXTERNAL_UNTRUSTED_CONTENT>>>) that can be escaped or stripped, this approach tracks data flow structurally through a directed acyclic graph (DAG) and enforces policies based on content lineage.

As a side benefit, being able to track and visualize the interactions in the graph increases auditability and visibility.

The Problem It Solves

OpenClaw currently wraps external content with XML markers and a security notice. This is a good first step, but has structural limitations:

  1. Markers can be escaped (T-EVADE-002) — if an attacker crafts content that breaks out of the XML wrapper, the content becomes indistinguishable from trusted content
  2. All content has equal standing once in the context window — there's no way to ask "did this turn's input originate from an external source?" at tool-call time
  3. No policy enforcement — even when content IS marked as external, the agent can still execute shell commands, send messages, or fetch URLs based on it
  4. Multi-hop taint is invisible — if external content gets stored in memory (T-PERSIST-005) and retrieved later, the external origin is lost

How It Works

Trust Taxonomy (6 levels)

Level Source Example
system Platform itself System prompt, runtime config
owner Verified owner messages Direct messages from owner IDs
local Local workspace files TOOLS.md, MEMORY.md, local scripts
shared Shared but semi-trusted Vestige memories, shared skills
external Outside content, potentially adversarial web_fetch results, emails, calendar
untrusted Likely adversarial Non-owner channel messages, webhook payloads

Creating a scheme for classifying these is in-scope (e.g. current exploration: use a trained classifier to give a score and confidence interval).

Per-Turn Provenance DAG

Each agent turn builds a directed graph tracking how content flows:

user_message (owner) → llm_call → tool_call(web_fetch) → tool_result (external) → llm_call → tool_call(exec)
                                                                                              ↑
                                                                              taint: external → POLICY CHECK

When the agent tries to call exec after processing web_fetch results, the graph shows the turn is tainted by external content. A policy engine can then block, restrict, or require confirmation.

Policy Engine

Declarative policies that reference taint levels:

policies:
  - name: no-exec-when-external
    when:
      tool: exec
      taintLevel: [external, untrusted]
    action: deny
    message: "Shell execution blocked: turn contains external content"

  - name: no-send-when-untrusted  
    when:
      tool: [message, sessions_send]
      taintLevel: [untrusted]
    action: deny

  - name: confirm-exec-when-shared
    when:
      tool: exec
      taintLevel: [shared]
    action: confirm  # user can !approve to override per-session

Four policy modes: allow, deny, restrict (limit tool arguments), confirm (block until user approves per-session).

Threats Addressed

This mitigation directly addresses or significantly reduces risk for 8 threats in the current model, and partially addresses 4 more:

Directly Addresses

Threat How
T-EXEC-001 (Direct Prompt Injection) External/untrusted messages get lower trust; policy restricts tool access on tainted turns
T-EXEC-002 (Indirect Prompt Injection) web_fetch, email content tagged external; taint propagates through the turn DAG
T-EXEC-004 (Exec Approval Bypass) Policy enforcement is structural, not LLM-dependent — the model can't talk its way past a taint check
T-ACCESS-006 (Prompt Injection via Channel) Non-owner channel messages get untrusted trust level automatically
T-EXFIL-001 (Data Theft via web_fetch) Policy blocks outbound requests on untrusted-tainted turns
T-EXFIL-002 (Unauthorized Message Sending) Policy blocks message sending on untrusted-tainted turns
T-IMPACT-001 (Unauthorized Command Execution) Core use case — exec blocked when turn tainted by external/untrusted content
T-EVADE-002 (Content Wrapper Escape) Taint tracking is structural (DAG edges), not text-based (XML markers). Escaping markers doesn't affect provenance

Partially Addresses

Threat How
T-PERSIST-005 (Memory Poisoning) Taint metadata could persist with memories, flagging externally-originated content on retrieval
T-EVADE-001 (Moderation Bypass) Structural tracking doesn't depend on regex pattern matching
T-EVADE-004 (Staged Payload Delivery) Turn-level recursion tracking + cross-turn taint propagation could detect multi-step attacks
T-DISC-003 (System Prompt Extraction) before_response_emit hook could detect system prompt content in responses on tainted turns

Implementation Approach

This could be implemented as a plugin using OpenClaw's existing hook system, with a few additional hooks:

Existing hooks used: before_tool_call (policy enforcement), after_tool_call (taint propagation from tool results), message_received (initial trust classification), message_sending (outbound gate)

New hooks needed:

  • before_llm_call — full context visible, can see all sources feeding into the call
  • after_llm_call — filter/block tool calls before execution based on taint
  • context_assembled — source census before LLM call
  • before_response_emit — final gate on outbound responses

We've built a proof-of-concept plugin implementing this approach. Happy to share the code or collaborate on upstreaming the hook extensions needed.

Why Plugin-Based

A plugin approach as a first implementation makes sense because it allows us to experiment with ideas and test them empirically before we make them part of the main code -- and making best-of-breed security ideas part of the main code is the ultimate goal.

For early investigation:

  • No core data model changes required
  • Users can opt in/out and tune policies
  • Different security postures for different deployments (strict for production, audit-only for development)
  • Community can iterate on policies without core releases

Related

  • Anthropic's Claude Code sandboxing takes a container-based approach (complementary)
  • PR #9271 (zero-trust secure gateway) handles credential isolation (complementary — taint graph handles content flow, secrets proxy handles credential access)
  • OWASP Agentic AI Top 10 identifies prompt injection and tool misuse as top risks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions