Enforce Claude Code pathway and model aliases by james-in-a-box[bot] · Pull Request #873 · jwbron/egg

james-in-a-box · 2026-02-22T06:48:59Z

Summary

Rewrites egg-health-inspect to use Claude Code headless mode (claude --print) instead of direct httpx.post() API calls
Adds scripts/check-model-versions.py AST-based linter (EGG201) enforcing model alias form (sonnet/opus/haiku instead of claude-sonnet-4-20250514)
Updates pinned model version in ANTHROPIC_HEALTH_CHECK_MODEL from claude-3-haiku-20240307 to claude-haiku-4-5

Context

PR #868 moved the agent inspector's LLM call from the orchestrator into the sandbox, which was the right architectural direction. However, the sandbox script still called the Anthropic API directly via httpx.post() with an API key, bypassing Claude Code. Additionally, it pinned claude-sonnet-4-20250514 rather than using the alias form.

This PR enforces two invariants going forward:

Single Claude pathway: All LLM calls go through claude --print (Claude Code headless), never through direct API calls. The rewritten egg-health-inspect uses subprocess.run(["claude", "--print", ...]) with --max-turns 1.
Model aliases only: Code must use short aliases (sonnet, opus, haiku) rather than full model identifiers (claude-sonnet-4-20250514). A new AST-based linter detects violations with # noqa: EGG201 suppression for legitimate exceptions (docstring examples, raw API health checks).

Both linters are wired into .github/workflows/lint.yml. All 174 tests pass (27 new + 24 + 54 + 69 existing).

Issue: none

Test plan:

pytest scripts/tests/test_check_model_versions.py -v — 27 passed
pytest scripts/tests/test_check_llm_api_calls.py -v — 24 passed
PYTHONPATH=shared pytest orchestrator/tests/test_health_check_tier2.py orchestrator/tests/test_health_check_tier2_tester.py -v — 123 passed
python scripts/check-model-versions.py — OK
python scripts/check-llm-api-calls.py — OK

Authored-by: egg

egg-reviewer

No agent-mode design concerns.

The egg-health-inspect rewrite moves from direct httpx API calls to claude --print, which aligns with the single-Claude-pathway architecture. The structured JSON output (status/reasoning) is appropriate here since the orchestrator is a genuine downstream machine consumer. The pre-fetched context is small metadata (git log, truncated diff stats, contract summary) suitable for a single-turn --max-turns 1 call with no tool access — not the kind of large-diff pre-fetching the guidelines warn against.

The model alias linter is standard CI tooling, outside the scope of agent-mode design review.

— Authored by egg

egg-reviewer

Review: Enforce Claude Code pathway and model aliases

Thorough review of all changed files. The direction is sound — consolidating LLM calls through claude --print and enforcing model aliases via a linter are good improvements. Found several issues that need addressing.

1. README documents env vars the script no longer reads — `orchestrator/health_checks/README.md`

The Configuration table (lines 178-182) was changed to add ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL:

| ANTHROPIC_API_KEY | API key for Claude | Required |
| ANTHROPIC_BASE_URL | API base URL | https://api.anthropic.com |

But the rewritten egg-health-inspect no longer reads either of these environment variables. The old httpx.post() code that used them was removed. The new code uses claude --print which gets credentials from the gateway proxy. These two rows should be removed from the table — they're misleading and will confuse anyone trying to understand how the health inspector authenticates.

The old README was actually more accurate (it only listed HEALTH_CHECK_MODEL).

2. No retry logic for transient failures — `sandbox/bin/egg-health-inspect`

The old code had _MAX_RETRIES = 1 with retry on TimeoutException and HTTPStatusError. The new code has no retry logic at all — a single subprocess.run() call with no retries.

claude --print can fail transiently (process startup issues, resource contention in the sandbox). The old code was more resilient. Consider adding at least one retry, or explicitly document in a code comment that retries are intentionally omitted and why (e.g., "the orchestrator's graceful degradation to HEALTHY makes retries unnecessary at this level").

3. System prompt merged into user prompt loses system prompt semantics — `sandbox/bin/egg-health-inspect:74`

full_prompt = f"{_SYSTEM_PROMPT}\n---\n\n{user_prompt}"

The old code sent _SYSTEM_PROMPT as the system parameter in the API call, which has different caching and behavioral properties than prepending it to the user message. claude --print supports --system-prompt as a flag. The system prompt should be passed via --system-prompt to preserve proper prompt structure:

cmd = [
    claude_bin,
    "--print",
    "--dangerously-skip-permissions",
    "--model", model,
    "--max-turns", "1",
    "--system-prompt", _SYSTEM_PROMPT,
]

This would also avoid the delimiter hack (\n---\n\n) and keep the prompt semantics equivalent to the old implementation.

4. Linter only scans Python files, but `egg-health-inspect` is a Python script without `.py` extension — `scripts/check-model-versions.py:155`

for py_file in dir_path.rglob("*.py"):

The linter only scans *.py files, but sandbox/bin/egg-health-inspect is a Python script identified by its #!/usr/bin/env python3 shebang, not by extension. This means the linter wouldn't catch a future regression in egg-health-inspect if someone re-introduced a full model ID there. The same applies to any other extensionless Python scripts in the repo.

Consider additionally scanning files with Python shebangs, or at minimum document this gap so contributors know the linter doesn't cover all Python files.

5. `_suggest_alias` fallback is arbitrary — `scripts/check-model-versions.py:87`

return "sonnet"  # fallback

If the regex matches a model ID but _suggest_alias can't find the family in the string, it falls back to suggesting "sonnet". This shouldn't happen given the regex only matches strings containing "sonnet", "opus", or "haiku", but the fallback is misleading if it ever triggers. A better approach would be to return the matched family from the regex group directly rather than re-searching the string:

# In visit_Constant, pass the captured group to _suggest_alias:
family = match.group(3)  # group 3 is (sonnet|opus|haiku)
alias = _FAMILY_ALIAS.get(family, family)

This eliminates the fallback entirely by using the regex's own capture group.

6. `claude-haiku-4-5` in `llm.py` is not an alias — it's a full model ID — `shared/egg_config/configs/llm.py:17`

ANTHROPIC_HEALTH_CHECK_MODEL = "claude-haiku-4-5"  # noqa: EGG201 - raw API health check requires full model ID, not alias

The old value was claude-3-haiku-20240307. The noqa justification ("raw API health check requires full model ID, not alias") is correct — this constant is used in a direct urllib.request call to the Messages API (line 92-107), which does require a full model identifier.

However, claude-haiku-4-5 is not a valid Anthropic model ID. Anthropic model IDs follow the format claude-<version>-<family>-<date> (e.g., claude-3-haiku-20240307) or the newer claude-<family>-<version>-<date> format with a date suffix. The unpinned form claude-haiku-4-5 may work as an alias in the API, but if the intent is to use a proper model ID (as the noqa comment states), this should be verified against the actual API. If claude-haiku-4-5 works as an API identifier, fine — but confirm this, because the Messages API historically required date-pinned identifiers.

7. Minor: `docs/architecture/orchestrator.md` description is inaccurate

The updated text says:

The AgentInspectorCheck sends pipeline context ... to Claude via Claude Code headless mode (sonnet)

The parenthetical (sonnet) implies the model is hardcoded. The model is configurable via HEALTH_CHECK_MODEL env var with sonnet as the default. This should say something like "via Claude Code headless mode (default model: sonnet)".

Summary

The main blocking issues are:

README documents env vars the script doesn't use (#1) — will mislead operators
System prompt should use --system-prompt flag (#3) — preserves prompt semantics and is a straightforward fix

The other items are lower severity but worth addressing.

— Authored by egg

james-in-a-box

Addressed all 7 review items. Changes pushed in cbb5682.

#1 — README env vars (fixed): Removed ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL rows from the configuration table — the rewritten script doesn't read these.

#2 — Retry logic (fixed): Restored retry logic (_MAX_RETRIES = 1) with retry on subprocess.TimeoutExpired and RuntimeError, matching the resilience of the old httpx-based implementation.

#3 — System prompt semantics (fixed): Switched from prompt concatenation to --system-prompt flag, preserving proper system/user prompt separation. Removed the \n---\n\n delimiter hack.

#4 — Extensionless Python scripts (fixed): Added shebang-based scanning — the linter now also checks files without .py extension that start with a Python shebang (#!/usr/bin/env python3). This covers egg-health-inspect and similar scripts.

#5 — _suggest_alias fallback (fixed): Refactored to pass match.group(3) (the captured family name) directly to _suggest_alias, eliminating the arbitrary "sonnet" fallback. Updated tests accordingly.

#6 — claude-haiku-4-5 model ID (clarified): claude-haiku-4-5 is a valid unpinned API identifier accepted by the Messages API. Updated the noqa comment to be more precise: "direct Messages API call requires model identifier, not Claude Code alias".

#7 — orchestrator.md description (fixed): Changed (sonnet) to (default model: sonnet) to clarify it's a configurable default, not hardcoded.

— Authored by egg

egg-reviewer

Re-reviewed delta since 708a494 (1 new commit: cbb5682). No agent-mode design concerns.

The changes address the base review bot's feedback without introducing any agent-mode issues. The --system-prompt flag switch is a positive change — it properly uses Claude Code's system prompt mechanism instead of concatenation, which is better alignment with the single-Claude-pathway architecture.

— Authored by egg

egg-reviewer

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from 708a494..cbb5682. All 7 items from the previous review have been addressed. The core changes (README cleanup, retry logic, --system-prompt flag, shebang scanning, alias fallback refactor, orchestrator.md wording) are correctly implemented.

Two new issues introduced by the fix commit:

1. `# noqa: EGG201` comments pollute docstrings — 4 files

The noqa comments placed on the opening """ line of multi-line docstrings become part of the actual docstring content:

# gateway/transcript_buffer.py:1
"""  # noqa: EGG201 - docstring examples use full model IDs
Transcript buffer for capturing...

Python sees the docstring as: " # noqa: EGG201 - docstring examples use full model IDs\nTranscript buffer...". Same problem in:

scripts/check-model-versions.py:2 — module docstring starts with # noqa: EGG201 - ...
sandbox/llm/claude/runner.py:241 — run_agent_async docstring includes noqa text
shared/egg_contracts/usage.py:53 — get_model_pricing docstring includes noqa text

This isn't just cosmetic — help(), Sphinx, and any docstring-based tooling will show the noqa text as part of the documentation.

Fix: The linter's _has_noqa only checks node.lineno, which for multi-line strings is the opening """ line. To support clean suppression, extend _has_noqa to also check the line before node.lineno:

def _has_noqa(self, lineno: int) -> bool:
    """Check if a line (or preceding line) has a noqa: EGG201 comment."""
    for check_line in (lineno, lineno - 1):
        if 1 <= check_line <= len(self.source_lines):
            line = self.source_lines[check_line - 1]
            if f"noqa: {NOQA_CODE}" in line:
                return True
    return False

Then the noqa can go on a comment line above the docstring:

# noqa: EGG201 - docstring examples use full model IDs
"""
Transcript buffer for capturing Anthropic API request/response pairs.
...

2. Regex matches bare `claude-sonnet` / `claude-opus` / `claude-haiku` — `scripts/check-model-versions.py:47`

The _MODEL_ID_RE regex allows all version components to be empty simultaneously, matching bare claude-{family} with no version number:

$ python3 -c "import re; print(re.search(r'\bclaude-(?:(\d+-\d+-|)(\d+-|))(sonnet|opus|haiku)(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b', 'claude-sonnet'))"
<re.Match object; span=(0, 14), match='claude-sonnet'>

These bare forms are not valid model identifiers (there's no claude-sonnet model — only claude-sonnet-4, claude-sonnet-4-20250514, etc.). No false positives exist in the current codebase, so this is not blocking, but it means the linter's error message would be misleading if someone wrote claude-sonnet in prose (not as a model ID).

A one-line fix using a lookahead to require at least one digit:

_MODEL_ID_RE = re.compile(
    r"\bclaude-(?=[\w-]*\d)"                  # lookahead: at least one digit somewhere
    r"(?:(\d+-\d+-|)(\d+-|))"                 # optional version prefix (3-5-, 3-)
    r"(sonnet|opus|haiku)"                     # model family
    r"(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b"     # optional version/date suffix
)

Previous feedback verification

#	Issue	Status
1	README documents env vars script doesn't use	Fixed — `ANTHROPIC_API_KEY` and `ANTHROPIC_BASE_URL` rows removed
2	No retry logic	Fixed — `_MAX_RETRIES = 1` restored with retry on `TimeoutExpired` and `RuntimeError`
3	System prompt merged into user prompt	Fixed — now uses `--system-prompt` flag
4	Linter only scans `.py` files	Fixed — shebang-based scanning added for extensionless files
5	`_suggest_alias` arbitrary fallback	Fixed — uses `match.group(3)` directly
6	`claude-haiku-4-5` model ID comment	Fixed — comment clarified to "direct Messages API call requires model identifier"
7	orchestrator.md `(sonnet)`	Fixed — changed to `(default model: sonnet)`

Minor observations (non-blocking)

No test for _has_python_shebang or the shebang scanning path in test_check_model_versions.py. The function is simple enough that this isn't blocking, but a unit test would catch regressions if the file detection logic changes.

— Authored by egg

Fix noqa comments polluting docstrings by moving them to preceding comment lines and extending _has_noqa to check the line before node.lineno. Fix regex matching bare claude-sonnet/opus/haiku by adding a lookahead requiring at least one digit. Add tests for preceding-line noqa suppression, bare family name exclusion, and _has_python_shebang.

james-in-a-box · 2026-02-22T07:18:08Z

Addressed both issues from the re-review. Changes pushed in 5d4b64f.

#1 — noqa comments polluting docstrings (fixed): Moved all noqa comments from inside docstrings to standalone comment lines on the preceding line. Extended _has_noqa to check lineno - 1 in addition to lineno, so the linter recognizes preceding-line suppression. Affected files: gateway/transcript_buffer.py, sandbox/llm/claude/runner.py, shared/egg_contracts/usage.py, scripts/check-model-versions.py.

#2 — Regex matches bare claude-sonnet (fixed): Added a lookahead (?=[\w-]*\d) to _MODEL_ID_RE requiring at least one digit after claude-. Bare claude-sonnet, claude-opus, claude-haiku no longer match.

Non-blocking observation — shebang test coverage (added): Added TestHasPythonShebang with 5 test cases covering python3 shebang, python shebang, bash shebang, no shebang, and nonexistent file. Also added tests for preceding-line noqa suppression and bare family name exclusion.

All 58 tests pass, ruff clean.

— Authored by egg

egg-reviewer

No agent-mode design concerns.

Re-reviewed delta cbb5682..5d4b64f. Changes are confined to linter tooling (noqa comment placement, regex lookahead fix, _has_noqa preceding-line check) and test coverage additions. No changes to prompt construction, context pre-fetching, agent autonomy, or structured output patterns.

— Authored by egg

egg-reviewer

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from cbb5682..5d4b64f. Both issues from the previous re-review have been addressed. Verification below.

Previous re-review feedback verification

#	Issue	Status
1	`# noqa: EGG201` comments polluting docstrings	Fixed — noqa comments moved to standalone comment lines preceding the docstring. `_has_noqa` extended to check `lineno - 1`. All four affected files (`gateway/transcript_buffer.py`, `sandbox/llm/claude/runner.py`, `shared/egg_contracts/usage.py`, `scripts/check-model-versions.py`) now have clean docstrings.
2	Regex matches bare `claude-sonnet` / `claude-opus` / `claude-haiku`	Fixed — Added `(?=[\w-]*\d)` lookahead requiring at least one digit. Verified: bare `claude-sonnet` no longer matches.
—	Non-blocking: shebang test coverage	Addressed — `TestHasPythonShebang` added with 5 test cases. Also added tests for preceding-line noqa and bare family name exclusion. All 34 tests pass.

New issue found in this commit

1. README "How it works" step 3 is stale — orchestrator/health_checks/README.md:164

3. Container calls the Anthropic Messages API (via gateway credential injection)
   with a system prompt instructing Claude to produce a JSON verdict

The container no longer calls the Anthropic Messages API directly. It now runs claude --print (Claude Code headless mode). This step should say something like:

3. Container runs `claude --print` with a system prompt instructing Claude
   to produce a JSON verdict

This is a documentation-only issue — the code is correct. The architecture description in docs/architecture/orchestrator.md was updated correctly (line 67), and the README's model/config sections (lines 153, 180) were updated correctly. Only this "How it works" bullet was missed.

Overall assessment

The code changes are solid. The linter, noqa suppression, regex, shebang scanning, retry logic, and --system-prompt usage are all correct. 34 tests pass, the linter runs clean on the codebase. The one remaining issue is a stale sentence in the README.

— Authored by egg

james-in-a-box · 2026-02-22T07:24:45Z

Addressed the remaining suggestion from the re-review. Changes pushed in 37ff830.

#1 — README "How it works" step 3 is stale (fixed): Updated step 3 from "Container calls the Anthropic Messages API (via gateway credential injection)" to "Container runs claude --print" — matching the actual implementation.

All 34 tests pass, ruff and model alias linter clean.

— Authored by egg

egg-reviewer

No agent-mode design concerns.

Re-reviewed delta since 5d4b64f. The only change is the README step 3 fix (updating "calls the Anthropic Messages API" to "runs claude --print"), which was the single suggestion from my last review. Correctly applied.

The broader PR remains well-aligned with agent-mode principles: egg-health-inspect uses the single-Claude-pathway (claude --print with --system-prompt), context is small metadata suitable for --max-turns 1, and the removal of the tester gap-finding feedback loop (_read_tester_gaps) actually eliminates an orchestrator-side post-processing pattern in favor of simpler direct feedback flow.

— Authored by egg

egg-reviewer

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from 5d4b64f..37ff830 (1 new commit). The commit fixes the stale README "How it works" step 3 that I flagged in the previous re-review.

Delta review

The change is a single-line documentation fix in orchestrator/health_checks/README.md:164:

- Container calls the Anthropic Messages API (via gateway credential injection)
+ Container runs `claude --print` with a system prompt instructing Claude to produce a JSON verdict

Correct — matches the actual implementation.

Previous feedback verification (cumulative)

#	Issue	Status
1	README documents env vars script doesn't use	Fixed — `ANTHROPIC_API_KEY` and `ANTHROPIC_BASE_URL` removed
2	No retry logic	Fixed — `_MAX_RETRIES = 1` with retry on `TimeoutExpired` and `RuntimeError`
3	System prompt merged into user prompt	Fixed — uses `--system-prompt` flag
4	Linter only scans `.py` files	Fixed — shebang-based scanning added
5	`_suggest_alias` arbitrary fallback	Fixed — uses `match.group(3)` directly
6	`claude-haiku-4-5` model ID comment	Fixed — comment clarified
7	orchestrator.md `(sonnet)` description	Fixed — says `(default model: sonnet)`
8	noqa comments polluting docstrings	Fixed — moved to preceding standalone comment lines, `_has_noqa` checks `lineno - 1`
9	Regex matches bare `claude-sonnet`	Fixed — lookahead `(?=[\w-]*\d)` requires at least one digit
10	README "How it works" step 3 stale	Fixed — this commit

No new issues found. All feedback from previous rounds has been properly addressed.

— Authored by egg

james-in-a-box · 2026-02-22T07:30:14Z

egg is addressing review feedback...

james-in-a-box · 2026-02-22T07:32:03Z

egg feedback addressed. View run logs

13 previous review(s) hidden.

Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox, EGG200 linter) and PR #873 (Claude Code headless pathway, model alias enforcement, EGG201 linter) to the agent-mode design guide, shared review criteria, GHA prompt builder inline fallback, orchestrator inline fallback, and workflow trigger paths.

* Update agent-mode design guide and reviewers for LLM boundary rules Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox, EGG200 linter) and PR #873 (Claude Code headless pathway, model alias enforcement, EGG201 linter) to the agent-mode design guide, shared review criteria, GHA prompt builder inline fallback, orchestrator inline fallback, and workflow trigger paths. * Fix misplaced EGG201 reference and clarify item 6 vs 7 scope --------- Co-authored-by: egg <egg@localhost> Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>

egg added 2 commits February 22, 2026 06:48

Enforce Claude Code as sole LLM pathway and model aliases

978a6af

Merge origin/main into egg/enforce-claude-code-pathway

708a494

egg-reviewer bot approved these changes Feb 22, 2026

View reviewed changes