[Mitigation] Content provenance taint graph for prompt injection defense

## Proposal

A content provenance tracking system ("taint graph") that traces the origin and trust level of every piece of content flowing through an agent turn. Instead of relying solely on XML markers (`<<<EXTERNAL_UNTRUSTED_CONTENT>>>`) that can be escaped or stripped, this approach tracks data flow structurally through a directed acyclic graph (DAG) and enforces policies based on content lineage.

As a side benefit, being able to track and visualize the interactions in the graph increases auditability and visibility.

## The Problem It Solves

OpenClaw currently wraps external content with XML markers and a security notice. This is a good first step, but has structural limitations:

1. **Markers can be escaped** (T-EVADE-002) — if an attacker crafts content that breaks out of the XML wrapper, the content becomes indistinguishable from trusted content
2. **All content has equal standing once in the context window** — there's no way to ask "did this turn's input originate from an external source?" at tool-call time
3. **No policy enforcement** — even when content IS marked as external, the agent can still execute shell commands, send messages, or fetch URLs based on it
4. **Multi-hop taint is invisible** — if external content gets stored in memory (T-PERSIST-005) and retrieved later, the external origin is lost

## How It Works

### Trust Taxonomy (6 levels)

| Level | Source | Example |
|-------|--------|---------|
| `system` | Platform itself | System prompt, runtime config |
| `owner` | Verified owner messages | Direct messages from owner IDs |
| `local` | Local workspace files | TOOLS.md, MEMORY.md, local scripts |
| `shared` | Shared but semi-trusted | Vestige memories, shared skills |
| `external` | Outside content, potentially adversarial | web_fetch results, emails, calendar |
| `untrusted` | Likely adversarial | Non-owner channel messages, webhook payloads |

Creating a scheme for classifying these is in-scope (e.g. current exploration: use a trained classifier to give a score and confidence interval).

### Per-Turn Provenance DAG

Each agent turn builds a directed graph tracking how content flows:

```
user_message (owner) → llm_call → tool_call(web_fetch) → tool_result (external) → llm_call → tool_call(exec)
                                                                                              ↑
                                                                              taint: external → POLICY CHECK
```

When the agent tries to call `exec` after processing `web_fetch` results, the graph shows the turn is tainted by `external` content. A policy engine can then block, restrict, or require confirmation.

### Policy Engine

Declarative policies that reference taint levels:

```yaml
policies:
  - name: no-exec-when-external
    when:
      tool: exec
      taintLevel: [external, untrusted]
    action: deny
    message: "Shell execution blocked: turn contains external content"

  - name: no-send-when-untrusted  
    when:
      tool: [message, sessions_send]
      taintLevel: [untrusted]
    action: deny

  - name: confirm-exec-when-shared
    when:
      tool: exec
      taintLevel: [shared]
    action: confirm  # user can !approve to override per-session
```

Four policy modes: `allow`, `deny`, `restrict` (limit tool arguments), `confirm` (block until user approves per-session).

## Threats Addressed

This mitigation directly addresses or significantly reduces risk for 8 threats in the current model, and partially addresses 4 more:

### Directly Addresses

| Threat | How |
|--------|-----|
| **T-EXEC-001** (Direct Prompt Injection) | External/untrusted messages get lower trust; policy restricts tool access on tainted turns |
| **T-EXEC-002** (Indirect Prompt Injection) | web_fetch, email content tagged `external`; taint propagates through the turn DAG |
| **T-EXEC-004** (Exec Approval Bypass) | Policy enforcement is structural, not LLM-dependent — the model can't talk its way past a taint check |
| **T-ACCESS-006** (Prompt Injection via Channel) | Non-owner channel messages get `untrusted` trust level automatically |
| **T-EXFIL-001** (Data Theft via web_fetch) | Policy blocks outbound requests on untrusted-tainted turns |
| **T-EXFIL-002** (Unauthorized Message Sending) | Policy blocks message sending on untrusted-tainted turns |
| **T-IMPACT-001** (Unauthorized Command Execution) | Core use case — exec blocked when turn tainted by external/untrusted content |
| **T-EVADE-002** (Content Wrapper Escape) | Taint tracking is structural (DAG edges), not text-based (XML markers). Escaping markers doesn't affect provenance |

### Partially Addresses

| Threat | How |
|--------|-----|
| **T-PERSIST-005** (Memory Poisoning) | Taint metadata could persist with memories, flagging externally-originated content on retrieval |
| **T-EVADE-001** (Moderation Bypass) | Structural tracking doesn't depend on regex pattern matching |
| **T-EVADE-004** (Staged Payload Delivery) | Turn-level recursion tracking + cross-turn taint propagation could detect multi-step attacks |
| **T-DISC-003** (System Prompt Extraction) | `before_response_emit` hook could detect system prompt content in responses on tainted turns |

## Implementation Approach

This could be implemented as a **plugin** using OpenClaw's existing hook system, with a few additional hooks:

**Existing hooks used:** `before_tool_call` (policy enforcement), `after_tool_call` (taint propagation from tool results), `message_received` (initial trust classification), `message_sending` (outbound gate)

**New hooks needed:**
- `before_llm_call` — full context visible, can see all sources feeding into the call
- `after_llm_call` — filter/block tool calls before execution based on taint
- `context_assembled` — source census before LLM call
- `before_response_emit` — final gate on outbound responses

We've built a proof-of-concept plugin implementing this approach. Happy to share the code or collaborate on upstreaming the hook extensions needed.

## Why Plugin-Based

A plugin approach as a first implementation makes sense because it allows us to experiment with ideas and test them empirically before we make them part of the main code -- and making best-of-breed security ideas part of the main code is the ultimate goal.

For early investigation:
- No core data model changes required
- Users can opt in/out and tune policies
- Different security postures for different deployments (strict for production, audit-only for development)
- Community can iterate on policies without core releases

## Related

- Anthropic's [Claude Code sandboxing](https://www.anthropic.com/engineering/claude-code-sandboxing) takes a container-based approach (complementary)
- PR #9271 (zero-trust secure gateway) handles credential isolation (complementary — taint graph handles content flow, secrets proxy handles credential access)
- [OWASP Agentic AI Top 10](https://owasp.org/www-project-agentic-ai-top-10/) identifies prompt injection and tool misuse as top risks

Threat	How
T-EXEC-001 (Direct Prompt Injection)	External/untrusted messages get lower trust; policy restricts tool access on tainted turns
T-EXEC-002 (Indirect Prompt Injection)	web_fetch, email content tagged `external`; taint propagates through the turn DAG
T-EXEC-004 (Exec Approval Bypass)	Policy enforcement is structural, not LLM-dependent — the model can't talk its way past a taint check
T-ACCESS-006 (Prompt Injection via Channel)	Non-owner channel messages get `untrusted` trust level automatically
T-EXFIL-001 (Data Theft via web_fetch)	Policy blocks outbound requests on untrusted-tainted turns
T-EXFIL-002 (Unauthorized Message Sending)	Policy blocks message sending on untrusted-tainted turns
T-IMPACT-001 (Unauthorized Command Execution)	Core use case — exec blocked when turn tainted by external/untrusted content
T-EVADE-002 (Content Wrapper Escape)	Taint tracking is structural (DAG edges), not text-based (XML markers). Escaping markers doesn't affect provenance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mitigation] Content provenance taint graph for prompt injection defense #2

Proposal

The Problem It Solves

How It Works

Trust Taxonomy (6 levels)

Per-Turn Provenance DAG

Policy Engine

Threats Addressed

Directly Addresses

Partially Addresses

Implementation Approach

Why Plugin-Based

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level	Source	Example
`system`	Platform itself	System prompt, runtime config
`owner`	Verified owner messages	Direct messages from owner IDs
`local`	Local workspace files	TOOLS.md, MEMORY.md, local scripts
`shared`	Shared but semi-trusted	Vestige memories, shared skills
`external`	Outside content, potentially adversarial	web_fetch results, emails, calendar
`untrusted`	Likely adversarial	Non-owner channel messages, webhook payloads

Threat	How
T-PERSIST-005 (Memory Poisoning)	Taint metadata could persist with memories, flagging externally-originated content on retrieval
T-EVADE-001 (Moderation Bypass)	Structural tracking doesn't depend on regex pattern matching
T-EVADE-004 (Staged Payload Delivery)	Turn-level recursion tracking + cross-turn taint propagation could detect multi-step attacks
T-DISC-003 (System Prompt Extraction)	`before_response_emit` hook could detect system prompt content in responses on tainted turns

Uh oh!

[Mitigation] Content provenance taint graph for prompt injection defense #2

Description

Proposal

The Problem It Solves

How It Works

Trust Taxonomy (6 levels)

Per-Turn Provenance DAG

Policy Engine

Threats Addressed

Directly Addresses

Partially Addresses

Implementation Approach

Why Plugin-Based

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions