Enforce Claude Code pathway and model aliases#873
Conversation
There was a problem hiding this comment.
No agent-mode design concerns.
The egg-health-inspect rewrite moves from direct httpx API calls to claude --print, which aligns with the single-Claude-pathway architecture. The structured JSON output (status/reasoning) is appropriate here since the orchestrator is a genuine downstream machine consumer. The pre-fetched context is small metadata (git log, truncated diff stats, contract summary) suitable for a single-turn --max-turns 1 call with no tool access — not the kind of large-diff pre-fetching the guidelines warn against.
The model alias linter is standard CI tooling, outside the scope of agent-mode design review.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Review: Enforce Claude Code pathway and model aliases
Thorough review of all changed files. The direction is sound — consolidating LLM calls through claude --print and enforcing model aliases via a linter are good improvements. Found several issues that need addressing.
1. README documents env vars the script no longer reads — orchestrator/health_checks/README.md
The Configuration table (lines 178-182) was changed to add ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL:
| ANTHROPIC_API_KEY | API key for Claude | Required |
| ANTHROPIC_BASE_URL | API base URL | https://api.anthropic.com |
But the rewritten egg-health-inspect no longer reads either of these environment variables. The old httpx.post() code that used them was removed. The new code uses claude --print which gets credentials from the gateway proxy. These two rows should be removed from the table — they're misleading and will confuse anyone trying to understand how the health inspector authenticates.
The old README was actually more accurate (it only listed HEALTH_CHECK_MODEL).
2. No retry logic for transient failures — sandbox/bin/egg-health-inspect
The old code had _MAX_RETRIES = 1 with retry on TimeoutException and HTTPStatusError. The new code has no retry logic at all — a single subprocess.run() call with no retries.
claude --print can fail transiently (process startup issues, resource contention in the sandbox). The old code was more resilient. Consider adding at least one retry, or explicitly document in a code comment that retries are intentionally omitted and why (e.g., "the orchestrator's graceful degradation to HEALTHY makes retries unnecessary at this level").
3. System prompt merged into user prompt loses system prompt semantics — sandbox/bin/egg-health-inspect:74
full_prompt = f"{_SYSTEM_PROMPT}\n---\n\n{user_prompt}"The old code sent _SYSTEM_PROMPT as the system parameter in the API call, which has different caching and behavioral properties than prepending it to the user message. claude --print supports --system-prompt as a flag. The system prompt should be passed via --system-prompt to preserve proper prompt structure:
cmd = [
claude_bin,
"--print",
"--dangerously-skip-permissions",
"--model", model,
"--max-turns", "1",
"--system-prompt", _SYSTEM_PROMPT,
]This would also avoid the delimiter hack (\n---\n\n) and keep the prompt semantics equivalent to the old implementation.
4. Linter only scans Python files, but egg-health-inspect is a Python script without .py extension — scripts/check-model-versions.py:155
for py_file in dir_path.rglob("*.py"):The linter only scans *.py files, but sandbox/bin/egg-health-inspect is a Python script identified by its #!/usr/bin/env python3 shebang, not by extension. This means the linter wouldn't catch a future regression in egg-health-inspect if someone re-introduced a full model ID there. The same applies to any other extensionless Python scripts in the repo.
Consider additionally scanning files with Python shebangs, or at minimum document this gap so contributors know the linter doesn't cover all Python files.
5. _suggest_alias fallback is arbitrary — scripts/check-model-versions.py:87
return "sonnet" # fallbackIf the regex matches a model ID but _suggest_alias can't find the family in the string, it falls back to suggesting "sonnet". This shouldn't happen given the regex only matches strings containing "sonnet", "opus", or "haiku", but the fallback is misleading if it ever triggers. A better approach would be to return the matched family from the regex group directly rather than re-searching the string:
# In visit_Constant, pass the captured group to _suggest_alias:
family = match.group(3) # group 3 is (sonnet|opus|haiku)
alias = _FAMILY_ALIAS.get(family, family)This eliminates the fallback entirely by using the regex's own capture group.
6. claude-haiku-4-5 in llm.py is not an alias — it's a full model ID — shared/egg_config/configs/llm.py:17
ANTHROPIC_HEALTH_CHECK_MODEL = "claude-haiku-4-5" # noqa: EGG201 - raw API health check requires full model ID, not aliasThe old value was claude-3-haiku-20240307. The noqa justification ("raw API health check requires full model ID, not alias") is correct — this constant is used in a direct urllib.request call to the Messages API (line 92-107), which does require a full model identifier.
However, claude-haiku-4-5 is not a valid Anthropic model ID. Anthropic model IDs follow the format claude-<version>-<family>-<date> (e.g., claude-3-haiku-20240307) or the newer claude-<family>-<version>-<date> format with a date suffix. The unpinned form claude-haiku-4-5 may work as an alias in the API, but if the intent is to use a proper model ID (as the noqa comment states), this should be verified against the actual API. If claude-haiku-4-5 works as an API identifier, fine — but confirm this, because the Messages API historically required date-pinned identifiers.
7. Minor: docs/architecture/orchestrator.md description is inaccurate
The updated text says:
The
AgentInspectorChecksends pipeline context ... to Claude via Claude Code headless mode (sonnet)
The parenthetical (sonnet) implies the model is hardcoded. The model is configurable via HEALTH_CHECK_MODEL env var with sonnet as the default. This should say something like "via Claude Code headless mode (default model: sonnet)".
Summary
The main blocking issues are:
- README documents env vars the script doesn't use (#1) — will mislead operators
- System prompt should use
--system-promptflag (#3) — preserves prompt semantics and is a straightforward fix
The other items are lower severity but worth addressing.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Addressed all 7 review items. Changes pushed in cbb5682.
#1 — README env vars (fixed): Removed ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL rows from the configuration table — the rewritten script doesn't read these.
#2 — Retry logic (fixed): Restored retry logic (_MAX_RETRIES = 1) with retry on subprocess.TimeoutExpired and RuntimeError, matching the resilience of the old httpx-based implementation.
#3 — System prompt semantics (fixed): Switched from prompt concatenation to --system-prompt flag, preserving proper system/user prompt separation. Removed the \n---\n\n delimiter hack.
#4 — Extensionless Python scripts (fixed): Added shebang-based scanning — the linter now also checks files without .py extension that start with a Python shebang (#!/usr/bin/env python3). This covers egg-health-inspect and similar scripts.
#5 — _suggest_alias fallback (fixed): Refactored to pass match.group(3) (the captured family name) directly to _suggest_alias, eliminating the arbitrary "sonnet" fallback. Updated tests accordingly.
#6 — claude-haiku-4-5 model ID (clarified): claude-haiku-4-5 is a valid unpinned API identifier accepted by the Messages API. Updated the noqa comment to be more precise: "direct Messages API call requires model identifier, not Claude Code alias".
#7 — orchestrator.md description (fixed): Changed (sonnet) to (default model: sonnet) to clarify it's a configurable default, not hardcoded.
— Authored by egg
There was a problem hiding this comment.
Re-reviewed delta since 708a494 (1 new commit: cbb5682). No agent-mode design concerns.
The changes address the base review bot's feedback without introducing any agent-mode issues. The --system-prompt flag switch is a positive change — it properly uses Claude Code's system prompt mechanism instead of concatenation, which is better alignment with the single-Claude-pathway architecture.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review: Enforce Claude Code pathway and model aliases
Reviewed delta from 708a494..cbb5682. All 7 items from the previous review have been addressed. The core changes (README cleanup, retry logic, --system-prompt flag, shebang scanning, alias fallback refactor, orchestrator.md wording) are correctly implemented.
Two new issues introduced by the fix commit:
1. # noqa: EGG201 comments pollute docstrings — 4 files
The noqa comments placed on the opening """ line of multi-line docstrings become part of the actual docstring content:
# gateway/transcript_buffer.py:1
""" # noqa: EGG201 - docstring examples use full model IDs
Transcript buffer for capturing...Python sees the docstring as: " # noqa: EGG201 - docstring examples use full model IDs\nTranscript buffer...". Same problem in:
scripts/check-model-versions.py:2— module docstring starts with# noqa: EGG201 - ...sandbox/llm/claude/runner.py:241—run_agent_asyncdocstring includes noqa textshared/egg_contracts/usage.py:53—get_model_pricingdocstring includes noqa text
This isn't just cosmetic — help(), Sphinx, and any docstring-based tooling will show the noqa text as part of the documentation.
Fix: The linter's _has_noqa only checks node.lineno, which for multi-line strings is the opening """ line. To support clean suppression, extend _has_noqa to also check the line before node.lineno:
def _has_noqa(self, lineno: int) -> bool:
"""Check if a line (or preceding line) has a noqa: EGG201 comment."""
for check_line in (lineno, lineno - 1):
if 1 <= check_line <= len(self.source_lines):
line = self.source_lines[check_line - 1]
if f"noqa: {NOQA_CODE}" in line:
return True
return FalseThen the noqa can go on a comment line above the docstring:
# noqa: EGG201 - docstring examples use full model IDs
"""
Transcript buffer for capturing Anthropic API request/response pairs.
...2. Regex matches bare claude-sonnet / claude-opus / claude-haiku — scripts/check-model-versions.py:47
The _MODEL_ID_RE regex allows all version components to be empty simultaneously, matching bare claude-{family} with no version number:
$ python3 -c "import re; print(re.search(r'\bclaude-(?:(\d+-\d+-|)(\d+-|))(sonnet|opus|haiku)(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b', 'claude-sonnet'))"
<re.Match object; span=(0, 14), match='claude-sonnet'>
These bare forms are not valid model identifiers (there's no claude-sonnet model — only claude-sonnet-4, claude-sonnet-4-20250514, etc.). No false positives exist in the current codebase, so this is not blocking, but it means the linter's error message would be misleading if someone wrote claude-sonnet in prose (not as a model ID).
A one-line fix using a lookahead to require at least one digit:
_MODEL_ID_RE = re.compile(
r"\bclaude-(?=[\w-]*\d)" # lookahead: at least one digit somewhere
r"(?:(\d+-\d+-|)(\d+-|))" # optional version prefix (3-5-, 3-)
r"(sonnet|opus|haiku)" # model family
r"(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b" # optional version/date suffix
)Previous feedback verification
| # | Issue | Status |
|---|---|---|
| 1 | README documents env vars script doesn't use | Fixed — ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL rows removed |
| 2 | No retry logic | Fixed — _MAX_RETRIES = 1 restored with retry on TimeoutExpired and RuntimeError |
| 3 | System prompt merged into user prompt | Fixed — now uses --system-prompt flag |
| 4 | Linter only scans .py files |
Fixed — shebang-based scanning added for extensionless files |
| 5 | _suggest_alias arbitrary fallback |
Fixed — uses match.group(3) directly |
| 6 | claude-haiku-4-5 model ID comment |
Fixed — comment clarified to "direct Messages API call requires model identifier" |
| 7 | orchestrator.md (sonnet) |
Fixed — changed to (default model: sonnet) |
Minor observations (non-blocking)
- No test for
_has_python_shebangor the shebang scanning path intest_check_model_versions.py. The function is simple enough that this isn't blocking, but a unit test would catch regressions if the file detection logic changes.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Fix noqa comments polluting docstrings by moving them to preceding comment lines and extending _has_noqa to check the line before node.lineno. Fix regex matching bare claude-sonnet/opus/haiku by adding a lookahead requiring at least one digit. Add tests for preceding-line noqa suppression, bare family name exclusion, and _has_python_shebang.
|
Addressed both issues from the re-review. Changes pushed in 5d4b64f. #1 — noqa comments polluting docstrings (fixed): Moved all noqa comments from inside docstrings to standalone comment lines on the preceding line. Extended #2 — Regex matches bare Non-blocking observation — shebang test coverage (added): Added All 58 tests pass, ruff clean. — Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
No agent-mode design concerns.
Re-reviewed delta cbb5682..5d4b64f. Changes are confined to linter tooling (noqa comment placement, regex lookahead fix, _has_noqa preceding-line check) and test coverage additions. No changes to prompt construction, context pre-fetching, agent autonomy, or structured output patterns.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review: Enforce Claude Code pathway and model aliases
Reviewed delta from cbb5682..5d4b64f. Both issues from the previous re-review have been addressed. Verification below.
Previous re-review feedback verification
| # | Issue | Status |
|---|---|---|
| 1 | # noqa: EGG201 comments polluting docstrings |
Fixed — noqa comments moved to standalone comment lines preceding the docstring. _has_noqa extended to check lineno - 1. All four affected files (gateway/transcript_buffer.py, sandbox/llm/claude/runner.py, shared/egg_contracts/usage.py, scripts/check-model-versions.py) now have clean docstrings. |
| 2 | Regex matches bare claude-sonnet / claude-opus / claude-haiku |
Fixed — Added (?=[\w-]*\d) lookahead requiring at least one digit. Verified: bare claude-sonnet no longer matches. |
| — | Non-blocking: shebang test coverage | Addressed — TestHasPythonShebang added with 5 test cases. Also added tests for preceding-line noqa and bare family name exclusion. All 34 tests pass. |
New issue found in this commit
1. README "How it works" step 3 is stale — orchestrator/health_checks/README.md:164
3. Container calls the Anthropic Messages API (via gateway credential injection)
with a system prompt instructing Claude to produce a JSON verdict
The container no longer calls the Anthropic Messages API directly. It now runs claude --print (Claude Code headless mode). This step should say something like:
3. Container runs `claude --print` with a system prompt instructing Claude
to produce a JSON verdict
This is a documentation-only issue — the code is correct. The architecture description in docs/architecture/orchestrator.md was updated correctly (line 67), and the README's model/config sections (lines 153, 180) were updated correctly. Only this "How it works" bullet was missed.
Overall assessment
The code changes are solid. The linter, noqa suppression, regex, shebang scanning, retry logic, and --system-prompt usage are all correct. 34 tests pass, the linter runs clean on the codebase. The one remaining issue is a stale sentence in the README.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Addressed the remaining suggestion from the re-review. Changes pushed in 37ff830. #1 — README "How it works" step 3 is stale (fixed): Updated step 3 from "Container calls the Anthropic Messages API (via gateway credential injection)" to "Container runs All 34 tests pass, ruff and model alias linter clean. — Authored by egg |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
No agent-mode design concerns.
Re-reviewed delta since 5d4b64f. The only change is the README step 3 fix (updating "calls the Anthropic Messages API" to "runs claude --print"), which was the single suggestion from my last review. Correctly applied.
The broader PR remains well-aligned with agent-mode principles: egg-health-inspect uses the single-Claude-pathway (claude --print with --system-prompt), context is small metadata suitable for --max-turns 1, and the removal of the tester gap-finding feedback loop (_read_tester_gaps) actually eliminates an orchestrator-side post-processing pattern in favor of simpler direct feedback flow.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Re-review: Enforce Claude Code pathway and model aliases
Reviewed delta from 5d4b64f..37ff830 (1 new commit). The commit fixes the stale README "How it works" step 3 that I flagged in the previous re-review.
Delta review
The change is a single-line documentation fix in orchestrator/health_checks/README.md:164:
- Container calls the Anthropic Messages API (via gateway credential injection)
+ Container runs `claude --print` with a system prompt instructing Claude to produce a JSON verdict
Correct — matches the actual implementation.
Previous feedback verification (cumulative)
| # | Issue | Status |
|---|---|---|
| 1 | README documents env vars script doesn't use | Fixed — ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL removed |
| 2 | No retry logic | Fixed — _MAX_RETRIES = 1 with retry on TimeoutExpired and RuntimeError |
| 3 | System prompt merged into user prompt | Fixed — uses --system-prompt flag |
| 4 | Linter only scans .py files |
Fixed — shebang-based scanning added |
| 5 | _suggest_alias arbitrary fallback |
Fixed — uses match.group(3) directly |
| 6 | claude-haiku-4-5 model ID comment |
Fixed — comment clarified |
| 7 | orchestrator.md (sonnet) description |
Fixed — says (default model: sonnet) |
| 8 | noqa comments polluting docstrings | Fixed — moved to preceding standalone comment lines, _has_noqa checks lineno - 1 |
| 9 | Regex matches bare claude-sonnet |
Fixed — lookahead (?=[\w-]*\d) requires at least one digit |
| 10 | README "How it works" step 3 stale | Fixed — this commit |
No new issues found. All feedback from previous rounds has been properly addressed.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
| egg is addressing review feedback... |
|
egg feedback addressed. View run logs 13 previous review(s) hidden. |
Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox, EGG200 linter) and PR #873 (Claude Code headless pathway, model alias enforcement, EGG201 linter) to the agent-mode design guide, shared review criteria, GHA prompt builder inline fallback, orchestrator inline fallback, and workflow trigger paths.
* Update agent-mode design guide and reviewers for LLM boundary rules Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox, EGG200 linter) and PR #873 (Claude Code headless pathway, model alias enforcement, EGG201 linter) to the agent-mode design guide, shared review criteria, GHA prompt builder inline fallback, orchestrator inline fallback, and workflow trigger paths. * Fix misplaced EGG201 reference and clarify item 6 vs 7 scope --------- Co-authored-by: egg <egg@localhost> Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>
Summary
egg-health-inspectto use Claude Code headless mode (claude --print) instead of directhttpx.post()API callsscripts/check-model-versions.pyAST-based linter (EGG201) enforcing model alias form (sonnet/opus/haikuinstead ofclaude-sonnet-4-20250514)ANTHROPIC_HEALTH_CHECK_MODELfromclaude-3-haiku-20240307toclaude-haiku-4-5Context
PR #868 moved the agent inspector's LLM call from the orchestrator into the sandbox, which was the right architectural direction. However, the sandbox script still called the Anthropic API directly via
httpx.post()with an API key, bypassing Claude Code. Additionally, it pinnedclaude-sonnet-4-20250514rather than using the alias form.This PR enforces two invariants going forward:
claude --print(Claude Code headless), never through direct API calls. The rewrittenegg-health-inspectusessubprocess.run(["claude", "--print", ...])with--max-turns 1.sonnet,opus,haiku) rather than full model identifiers (claude-sonnet-4-20250514). A new AST-based linter detects violations with# noqa: EGG201suppression for legitimate exceptions (docstring examples, raw API health checks).Both linters are wired into
.github/workflows/lint.yml. All 174 tests pass (27 new + 24 + 54 + 69 existing).Issue: none
Test plan:
pytest scripts/tests/test_check_model_versions.py -v— 27 passedpytest scripts/tests/test_check_llm_api_calls.py -v— 24 passedPYTHONPATH=shared pytest orchestrator/tests/test_health_check_tier2.py orchestrator/tests/test_health_check_tier2_tester.py -v— 123 passedpython scripts/check-model-versions.py— OKpython scripts/check-llm-api-calls.py— OKAuthored-by: egg