Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .claude/plugins/.claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "entire-dev-tools",
"owner": {
"name": "Entire Team"
},
"plugins": [
{
"name": "e2e",
"source": "./e2e",
"description": "E2E test triage, debugging, and fix implementation toolkit"
},
{
"name": "agent-integration",
"source": "./agent-integration",
"description": "Multi-phase toolkit for integrating a new AI coding agent with the Entire CLI"
}
]
}
5 changes: 5 additions & 0 deletions .claude/plugins/e2e/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"name": "e2e",
"description": "E2E test triage, debugging, and fix implementation toolkit",
"version": "1.0.0"
}
15 changes: 15 additions & 0 deletions .claude/plugins/e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# E2E Plugin

Local plugin providing individual commands for E2E test triage and debugging.

## Commands

| Command | Description |
|---------|-------------|
| `/e2e:triage-ci` | Run failing tests locally, classify flaky vs real-bug, present findings report |
| `/e2e:debug` | Deep-dive artifact analysis for root cause diagnosis |
| `/e2e:implement` | Apply fixes from triage/debug findings, verify with E2E tests |

## Related

- Orchestrator skill: `.claude/skills/e2e/SKILL.md` (`/e2e` — runs triage-ci then implement)
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/debug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Deep-dive artifact analysis for diagnosing E2E test failures"
---

# Debug Command

Read and follow the full procedure from `.claude/skills/e2e/debug.md`.
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/implement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Apply fixes from triage/debug findings, verify with scoped E2E tests"
---

# Implement Command

Read and follow the full procedure from `.claude/skills/e2e/implement.md`.
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/triage-ci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Triage E2E failures via local reruns or CI artifacts, classify flaky vs real-bug, present findings report"
---

# Triage CI Command

Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.
12 changes: 12 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
{
"extraKnownMarketplaces": {
"entire-dev-tools": {
"source": {
"source": "directory",
"path": "./.claude/plugins"
}
}
},
"enabledPlugins": {
"e2e@entire-dev-tools": true,
"agent-integration@entire-dev-tools": true
},
"hooks": {
"SessionStart": [
{
Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/agent-integration/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ This skill enforces strict E2E-first test-driven development. The rules:
3. **Unit tests are written last.** After all E2E tiers pass (Step 14), you write unit tests using real data collected from E2E runs as golden fixtures.
4. **If you didn't watch it fail, you don't know if it tests the right thing.** Never write a test you haven't seen fail first.
5. **Minimum viable fix.** At each E2E failure, implement only the code needed to fix that failure. Don't anticipate future tiers.
6. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
6. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.

## Pipeline

Expand Down
20 changes: 10 additions & 10 deletions .claude/skills/agent-integration/implementer.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Build the agent Go package using strict E2E-first TDD. Unit tests are written ON
1. **E2E tests are the spec.** The existing `ForEachAgent` test scenarios define "working". You implement until they pass.
2. **Watch it fail first.** Every E2E tier starts by running the test and observing the failure. If you haven't seen the failure, you don't understand what needs fixing.
3. **Minimum viable fix.** At each failure, implement only the code needed to make that specific assertion pass. Don't anticipate future tiers.
4. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
4. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.
5. **No unit tests during Steps 4-13.** Unit tests are written in Step 14 after all E2E tiers pass, using real data from E2E runs as golden fixtures.
6. **Format and lint, don't unit test.** Between E2E tiers, run `mise run fmt && mise run lint` to keep code clean. Any earlier `mise run test` invocations (e.g., in Step 3) are strictly compile-only sanity checks — no `mise run test` between E2E tiers (Steps 4-13).
7. **If you didn't watch it fail, you don't know if it tests the right thing.**
Expand Down Expand Up @@ -83,7 +83,7 @@ This test requires no agent prompts — it only exercises hooks, so it's the fas

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestHumanOnlyChangesAndCommits`
2. **Watch it fail** — read the failure output carefully
3. If there are artifact dirs, use `/debug-e2e {artifact-dir}` to understand what happened
3. If there are artifact dirs, use `/e2e:debug {artifact-dir}` to understand what happened
4. Implement the minimum code to fix the first failure
5. Repeat until the test passes

Expand All @@ -105,7 +105,7 @@ The foundational test. This exercises the full agent lifecycle: start session

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionManualCommit`
2. **Watch it fail** — read the failure output carefully
3. Use `/debug-e2e {artifact-dir}` to understand what happened
3. Use `/e2e:debug {artifact-dir}` to understand what happened
4. Implement the minimum code to fix the first failure
5. Repeat until the test passes

Expand All @@ -127,7 +127,7 @@ Validates transcript quality: JSONL validity, content hash correctness, prompt e

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestCheckpointMetadataDeepValidation`
2. **Watch it fail** — this test often exposes subtle transcript formatting bugs
3. Use `/debug-e2e {artifact-dir}` on any failures
3. Use `/e2e:debug {artifact-dir}` on any failures
4. Fix and repeat

Run: `mise run fmt && mise run lint`
Expand All @@ -146,7 +146,7 @@ Agent creates files and commits them within a single prompt turn. Tests the in-t
**Cycle:**

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionAgentCommitInTurn`
2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
3. Fix and repeat — if the agent doesn't support committing, skip this test

Run: `mise run fmt && mise run lint`
Expand All @@ -164,7 +164,7 @@ Run these tests to validate multi-session behavior:
**Cycle (for each test):**

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestMultiSessionManualCommit`
2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
3. Fix and repeat
4. Move to next test

Expand All @@ -183,7 +183,7 @@ Run these tests for file operation correctness:
- `TestDeletedFilesCommitDeletion` — Agent deletes a file, user commits the deletion
- `TestMixedNewAndModifiedFiles` — Agent both creates and modifies files

**Cycle:** Same as above — run each test, **watch it fail**, use `/debug-e2e` on failures, fix, repeat.
**Cycle:** Same as above — run each test, **watch it fail**, use `/e2e:debug` on failures, fix, repeat.

Run: `mise run fmt && mise run lint`

Expand Down Expand Up @@ -215,7 +215,7 @@ Run these if the agent supports interactive multi-step sessions:
- `TestRewindAfterCommit` — Rewind to a checkpoint after committing
- `TestRewindMultipleFiles` — Rewind with multiple files changed

**Cycle:** Same pattern — run, **watch it fail**, `/debug-e2e` on failures, fix, repeat.
**Cycle:** Same pattern — run, **watch it fail**, `/e2e:debug` on failures, fix, repeat.

Run: `mise run fmt && mise run lint`

Expand Down Expand Up @@ -256,7 +256,7 @@ mise run test:e2e --agent $AGENT_SLUG TestFailingTestName

If a test passes when run individually but fails in the full suite, it's a flaky failure — not a real error. Only investigate failures that reproduce consistently when run in isolation.

Fix any real failures before proceeding — the same cycle applies: read the failure, use `/debug-e2e {artifact-dir}`, implement the minimum fix, re-run.
Fix any real failures before proceeding — the same cycle applies: read the failure, use `/e2e:debug {artifact-dir}`, implement the minimum fix, re-run.

All E2E tests must pass before writing unit tests.

Expand Down Expand Up @@ -321,7 +321,7 @@ At every E2E failure, follow this protocol:

1. **Read the test output** — the assertion message often tells you exactly what's wrong
2. **Find the artifact directory** — E2E tests save artifacts (logs, transcripts, git state) to a temp dir printed in the output
3. **Run `/debug-e2e {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
3. **Run `/e2e:debug {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
4. **Implement the minimum fix** — don't over-engineer; fix only what the test demands
5. **Re-run the failing test** — not the whole suite, just the one test

Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/agent-integration/test-writer.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ Use `/commit` to commit all files.
- **Interactive tests**: Use `s.StartSession`, `s.Send`, `s.WaitFor` — tmux pane is auto-captured in artifacts
- **Run commands**: `mise run test:e2e --agent ${slug} TestName` — see `e2e/README.md` for all options
- **E2E tests are run during the implement phase**: This phase only creates the runner. The implement phase runs E2E tests at each tier to drive development.
- **Debugging failures**: If tests fail during the implement phase, use `/debug-e2e` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)
- **Debugging failures**: If tests fail during the implement phase, use `/e2e:debug` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)

## Output

Expand Down
32 changes: 32 additions & 0 deletions .claude/skills/e2e/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: e2e
description: >
Orchestrate E2E test triage and fix implementation: runs triage-ci then implement sequentially.
Accepts test names, --agent, artifact path, or CI run reference.
For individual phases, use /e2e:triage-ci, /e2e:debug, or /e2e:implement.
Use when the user says "triage e2e", "fix e2e failures", or wants the full triage-to-fix pipeline.
---

# E2E Triage & Fix — Full Pipeline

Run triage-ci then implement sequentially. Parameters are collected once and reused across both phases.

## Parameters

The user provides one or more of:
- **Test name(s)** -- e.g., `TestInteractiveMultiStep`
- **`--agent <agent>`** -- optional, defaults to all agents that previously failed
- **A local artifact path** -- skip straight to analysis of existing artifacts
- **CI run reference** -- `latest`, a run ID, or a run URL

## Phase 1: Triage CI

Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.

This produces a findings report with classifications (flaky/real-bug/test-bug) for each test+agent pair.

## Phase 2: Implement Fixes

Read and follow the full procedure from `.claude/skills/e2e/implement.md`.

Uses the findings from Phase 1 (already in conversation context) to propose, apply, and verify fixes.
27 changes: 11 additions & 16 deletions .claude/skills/debug-e2e/SKILL.md → .claude/skills/e2e/debug.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,12 @@
---
name: debug-e2e
description: Use when investigating E2E test failures from artifacts to diagnose bugs in the Entire CLI, or when pointed at an artifact path for root cause analysis
---

# Debug Entire CLI via E2E Artifacts

Diagnose Entire CLI bugs using captured artifacts from the E2E test suite. Artifacts are written to `e2e/artifacts/` locally or downloaded from CI via GitHub Actions.

## Inputs

The user provides either:
- **A test run directory:** `e2e/artifacts/{timestamp}/` triage all failures
- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` debug one test
- **A test run directory:** `e2e/artifacts/{timestamp}/` -- triage all failures
- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` -- debug one test

## Artifact Layout

Expand All @@ -32,7 +27,7 @@ e2e/artifacts/{timestamp}/

## Preserved Repo

When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on you can inspect it directly.
When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on -- you can inspect it directly.

**Navigate via the symlink** (e.g., `{artifact-dir}/repo/`) rather than resolving the `/tmp/...` path. The symlink lives inside the artifact directory so permissions and paths stay consistent.

Expand All @@ -42,7 +37,7 @@ The preserved repo contains:
- The `.claude/` directory (if Claude Code was the agent)
- All files the agent created or modified, in their final state

This is the most powerful debugging tool you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.
This is the most powerful debugging tool -- you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.

## Debugging Workflow

Expand All @@ -53,9 +48,9 @@ Read `report.nocolor.txt` to identify failures and their error messages. Each en
### 2. Read console.log (most important)

Full transcript of every operation:
- `> claude -p "..." ...` agent prompts with stdout/stderr
- `> git add/commit/...` git commands
- `> send: ...` interactive session inputs
- `> claude -p "..." ...` -- agent prompts with stdout/stderr
- `> git add/commit/...` -- git commands
- `> send: ...` -- interactive session inputs

This tells you what happened chronologically.

Expand All @@ -71,14 +66,14 @@ Cross-reference console.log (what happened) with the test (what should have happ
|---------|-------------------|
| Checkpoint not created / timeout | Check `entire-logs/entire.log` for hook invocations, phase transitions, errors |
| Wrong checkpoint content | Check `git-tree.txt` for checkpoint branch files, `checkpoint-metadata/` for session info |
| Hooks didn't fire | Check `entire.log` for missing hook entries (session-start, user-prompt-submit, stop, post-commit) |
| Stash/unstash problems | Check `entire.log` for stash-related log lines, `git-log.txt` for commit ordering |
| Hooks didn't fire | Check `entire-logs/entire.log` for missing hook entries (session-start, user-prompt-submit, stop, post-commit) |
| Stash/unstash problems | Check `entire-logs/entire.log` for stash-related log lines, `git-log.txt` for commit ordering |
| Attribution issues | Check `checkpoint-metadata/` for `files_touched`, session metadata for attribution data |
| Strategy mismatch | Check `entire.log` for `strategy` field, verify auto-commit vs manual-commit behavior |
| Strategy mismatch | Check `entire-logs/entire.log` for `strategy` field, verify auto-commit vs manual-commit behavior |

### 5. Deep dive files

- **entire-logs/entire.log**: Structured JSON logs hook lifecycle, session phases (`active` `idle` `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
- **entire-logs/entire.log**: Structured JSON logs -- hook lifecycle, session phases (`active` -> `idle` -> `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
- **git-log.txt**: Commit graph showing main branch, `entire/checkpoints/v1`, checkpoint initialization.
- **git-tree.txt**: Files at HEAD vs checkpoint branch (separated by `--- entire/checkpoints/v1 ---`).
- **checkpoint-metadata/**: `metadata.json` has `checkpoint_id`, `strategy`, `files_touched`, `token_usage`, and `sessions` array. Session subdirs have per-session details.
Expand Down
Loading