-
-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Proposal
A content provenance tracking system ("taint graph") that traces the origin and trust level of every piece of content flowing through an agent turn. Instead of relying solely on XML markers (<<<EXTERNAL_UNTRUSTED_CONTENT>>>) that can be escaped or stripped, this approach tracks data flow structurally through a directed acyclic graph (DAG) and enforces policies based on content lineage.
As a side benefit, being able to track and visualize the interactions in the graph increases auditability and visibility.
The Problem It Solves
OpenClaw currently wraps external content with XML markers and a security notice. This is a good first step, but has structural limitations:
- Markers can be escaped (T-EVADE-002) — if an attacker crafts content that breaks out of the XML wrapper, the content becomes indistinguishable from trusted content
- All content has equal standing once in the context window — there's no way to ask "did this turn's input originate from an external source?" at tool-call time
- No policy enforcement — even when content IS marked as external, the agent can still execute shell commands, send messages, or fetch URLs based on it
- Multi-hop taint is invisible — if external content gets stored in memory (T-PERSIST-005) and retrieved later, the external origin is lost
How It Works
Trust Taxonomy (6 levels)
| Level | Source | Example |
|---|---|---|
system |
Platform itself | System prompt, runtime config |
owner |
Verified owner messages | Direct messages from owner IDs |
local |
Local workspace files | TOOLS.md, MEMORY.md, local scripts |
shared |
Shared but semi-trusted | Vestige memories, shared skills |
external |
Outside content, potentially adversarial | web_fetch results, emails, calendar |
untrusted |
Likely adversarial | Non-owner channel messages, webhook payloads |
Creating a scheme for classifying these is in-scope (e.g. current exploration: use a trained classifier to give a score and confidence interval).
Per-Turn Provenance DAG
Each agent turn builds a directed graph tracking how content flows:
user_message (owner) → llm_call → tool_call(web_fetch) → tool_result (external) → llm_call → tool_call(exec)
↑
taint: external → POLICY CHECK
When the agent tries to call exec after processing web_fetch results, the graph shows the turn is tainted by external content. A policy engine can then block, restrict, or require confirmation.
Policy Engine
Declarative policies that reference taint levels:
policies:
- name: no-exec-when-external
when:
tool: exec
taintLevel: [external, untrusted]
action: deny
message: "Shell execution blocked: turn contains external content"
- name: no-send-when-untrusted
when:
tool: [message, sessions_send]
taintLevel: [untrusted]
action: deny
- name: confirm-exec-when-shared
when:
tool: exec
taintLevel: [shared]
action: confirm # user can !approve to override per-sessionFour policy modes: allow, deny, restrict (limit tool arguments), confirm (block until user approves per-session).
Threats Addressed
This mitigation directly addresses or significantly reduces risk for 8 threats in the current model, and partially addresses 4 more:
Directly Addresses
| Threat | How |
|---|---|
| T-EXEC-001 (Direct Prompt Injection) | External/untrusted messages get lower trust; policy restricts tool access on tainted turns |
| T-EXEC-002 (Indirect Prompt Injection) | web_fetch, email content tagged external; taint propagates through the turn DAG |
| T-EXEC-004 (Exec Approval Bypass) | Policy enforcement is structural, not LLM-dependent — the model can't talk its way past a taint check |
| T-ACCESS-006 (Prompt Injection via Channel) | Non-owner channel messages get untrusted trust level automatically |
| T-EXFIL-001 (Data Theft via web_fetch) | Policy blocks outbound requests on untrusted-tainted turns |
| T-EXFIL-002 (Unauthorized Message Sending) | Policy blocks message sending on untrusted-tainted turns |
| T-IMPACT-001 (Unauthorized Command Execution) | Core use case — exec blocked when turn tainted by external/untrusted content |
| T-EVADE-002 (Content Wrapper Escape) | Taint tracking is structural (DAG edges), not text-based (XML markers). Escaping markers doesn't affect provenance |
Partially Addresses
| Threat | How |
|---|---|
| T-PERSIST-005 (Memory Poisoning) | Taint metadata could persist with memories, flagging externally-originated content on retrieval |
| T-EVADE-001 (Moderation Bypass) | Structural tracking doesn't depend on regex pattern matching |
| T-EVADE-004 (Staged Payload Delivery) | Turn-level recursion tracking + cross-turn taint propagation could detect multi-step attacks |
| T-DISC-003 (System Prompt Extraction) | before_response_emit hook could detect system prompt content in responses on tainted turns |
Implementation Approach
This could be implemented as a plugin using OpenClaw's existing hook system, with a few additional hooks:
Existing hooks used: before_tool_call (policy enforcement), after_tool_call (taint propagation from tool results), message_received (initial trust classification), message_sending (outbound gate)
New hooks needed:
before_llm_call— full context visible, can see all sources feeding into the callafter_llm_call— filter/block tool calls before execution based on taintcontext_assembled— source census before LLM callbefore_response_emit— final gate on outbound responses
We've built a proof-of-concept plugin implementing this approach. Happy to share the code or collaborate on upstreaming the hook extensions needed.
Why Plugin-Based
A plugin approach as a first implementation makes sense because it allows us to experiment with ideas and test them empirically before we make them part of the main code -- and making best-of-breed security ideas part of the main code is the ultimate goal.
For early investigation:
- No core data model changes required
- Users can opt in/out and tune policies
- Different security postures for different deployments (strict for production, audit-only for development)
- Community can iterate on policies without core releases
Related
- Anthropic's Claude Code sandboxing takes a container-based approach (complementary)
- PR #9271 (zero-trust secure gateway) handles credential isolation (complementary — taint graph handles content flow, secrets proxy handles credential access)
- OWASP Agentic AI Top 10 identifies prompt injection and tool misuse as top risks