Skip to content

Enforce Claude Code pathway and model aliases#873

Merged
jwbron merged 5 commits intomainfrom
egg/enforce-claude-code-pathway
Feb 22, 2026
Merged

Enforce Claude Code pathway and model aliases#873
jwbron merged 5 commits intomainfrom
egg/enforce-claude-code-pathway

Conversation

@james-in-a-box
Copy link
Contributor

Summary

  • Rewrites egg-health-inspect to use Claude Code headless mode (claude --print) instead of direct httpx.post() API calls
  • Adds scripts/check-model-versions.py AST-based linter (EGG201) enforcing model alias form (sonnet/opus/haiku instead of claude-sonnet-4-20250514)
  • Updates pinned model version in ANTHROPIC_HEALTH_CHECK_MODEL from claude-3-haiku-20240307 to claude-haiku-4-5

Context

PR #868 moved the agent inspector's LLM call from the orchestrator into the sandbox, which was the right architectural direction. However, the sandbox script still called the Anthropic API directly via httpx.post() with an API key, bypassing Claude Code. Additionally, it pinned claude-sonnet-4-20250514 rather than using the alias form.

This PR enforces two invariants going forward:

  1. Single Claude pathway: All LLM calls go through claude --print (Claude Code headless), never through direct API calls. The rewritten egg-health-inspect uses subprocess.run(["claude", "--print", ...]) with --max-turns 1.
  2. Model aliases only: Code must use short aliases (sonnet, opus, haiku) rather than full model identifiers (claude-sonnet-4-20250514). A new AST-based linter detects violations with # noqa: EGG201 suppression for legitimate exceptions (docstring examples, raw API health checks).

Both linters are wired into .github/workflows/lint.yml. All 174 tests pass (27 new + 24 + 54 + 69 existing).

Issue: none

Test plan:

  • pytest scripts/tests/test_check_model_versions.py -v — 27 passed
  • pytest scripts/tests/test_check_llm_api_calls.py -v — 24 passed
  • PYTHONPATH=shared pytest orchestrator/tests/test_health_check_tier2.py orchestrator/tests/test_health_check_tier2_tester.py -v — 123 passed
  • python scripts/check-model-versions.py — OK
  • python scripts/check-llm-api-calls.py — OK

Authored-by: egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

The egg-health-inspect rewrite moves from direct httpx API calls to claude --print, which aligns with the single-Claude-pathway architecture. The structured JSON output (status/reasoning) is appropriate here since the orchestrator is a genuine downstream machine consumer. The pre-fetched context is small metadata (git log, truncated diff stats, contract summary) suitable for a single-turn --max-turns 1 call with no tool access — not the kind of large-diff pre-fetching the guidelines warn against.

The model alias linter is standard CI tooling, outside the scope of agent-mode design review.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Enforce Claude Code pathway and model aliases

Thorough review of all changed files. The direction is sound — consolidating LLM calls through claude --print and enforcing model aliases via a linter are good improvements. Found several issues that need addressing.


1. README documents env vars the script no longer readsorchestrator/health_checks/README.md

The Configuration table (lines 178-182) was changed to add ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL:

| ANTHROPIC_API_KEY | API key for Claude | Required |
| ANTHROPIC_BASE_URL | API base URL | https://api.anthropic.com |

But the rewritten egg-health-inspect no longer reads either of these environment variables. The old httpx.post() code that used them was removed. The new code uses claude --print which gets credentials from the gateway proxy. These two rows should be removed from the table — they're misleading and will confuse anyone trying to understand how the health inspector authenticates.

The old README was actually more accurate (it only listed HEALTH_CHECK_MODEL).


2. No retry logic for transient failuressandbox/bin/egg-health-inspect

The old code had _MAX_RETRIES = 1 with retry on TimeoutException and HTTPStatusError. The new code has no retry logic at all — a single subprocess.run() call with no retries.

claude --print can fail transiently (process startup issues, resource contention in the sandbox). The old code was more resilient. Consider adding at least one retry, or explicitly document in a code comment that retries are intentionally omitted and why (e.g., "the orchestrator's graceful degradation to HEALTHY makes retries unnecessary at this level").


3. System prompt merged into user prompt loses system prompt semanticssandbox/bin/egg-health-inspect:74

full_prompt = f"{_SYSTEM_PROMPT}\n---\n\n{user_prompt}"

The old code sent _SYSTEM_PROMPT as the system parameter in the API call, which has different caching and behavioral properties than prepending it to the user message. claude --print supports --system-prompt as a flag. The system prompt should be passed via --system-prompt to preserve proper prompt structure:

cmd = [
    claude_bin,
    "--print",
    "--dangerously-skip-permissions",
    "--model", model,
    "--max-turns", "1",
    "--system-prompt", _SYSTEM_PROMPT,
]

This would also avoid the delimiter hack (\n---\n\n) and keep the prompt semantics equivalent to the old implementation.


4. Linter only scans Python files, but egg-health-inspect is a Python script without .py extensionscripts/check-model-versions.py:155

for py_file in dir_path.rglob("*.py"):

The linter only scans *.py files, but sandbox/bin/egg-health-inspect is a Python script identified by its #!/usr/bin/env python3 shebang, not by extension. This means the linter wouldn't catch a future regression in egg-health-inspect if someone re-introduced a full model ID there. The same applies to any other extensionless Python scripts in the repo.

Consider additionally scanning files with Python shebangs, or at minimum document this gap so contributors know the linter doesn't cover all Python files.


5. _suggest_alias fallback is arbitraryscripts/check-model-versions.py:87

return "sonnet"  # fallback

If the regex matches a model ID but _suggest_alias can't find the family in the string, it falls back to suggesting "sonnet". This shouldn't happen given the regex only matches strings containing "sonnet", "opus", or "haiku", but the fallback is misleading if it ever triggers. A better approach would be to return the matched family from the regex group directly rather than re-searching the string:

# In visit_Constant, pass the captured group to _suggest_alias:
family = match.group(3)  # group 3 is (sonnet|opus|haiku)
alias = _FAMILY_ALIAS.get(family, family)

This eliminates the fallback entirely by using the regex's own capture group.


6. claude-haiku-4-5 in llm.py is not an alias — it's a full model IDshared/egg_config/configs/llm.py:17

ANTHROPIC_HEALTH_CHECK_MODEL = "claude-haiku-4-5"  # noqa: EGG201 - raw API health check requires full model ID, not alias

The old value was claude-3-haiku-20240307. The noqa justification ("raw API health check requires full model ID, not alias") is correct — this constant is used in a direct urllib.request call to the Messages API (line 92-107), which does require a full model identifier.

However, claude-haiku-4-5 is not a valid Anthropic model ID. Anthropic model IDs follow the format claude-<version>-<family>-<date> (e.g., claude-3-haiku-20240307) or the newer claude-<family>-<version>-<date> format with a date suffix. The unpinned form claude-haiku-4-5 may work as an alias in the API, but if the intent is to use a proper model ID (as the noqa comment states), this should be verified against the actual API. If claude-haiku-4-5 works as an API identifier, fine — but confirm this, because the Messages API historically required date-pinned identifiers.


7. Minor: docs/architecture/orchestrator.md description is inaccurate

The updated text says:

The AgentInspectorCheck sends pipeline context ... to Claude via Claude Code headless mode (sonnet)

The parenthetical (sonnet) implies the model is hardcoded. The model is configurable via HEALTH_CHECK_MODEL env var with sonnet as the default. This should say something like "via Claude Code headless mode (default model: sonnet)".


Summary

The main blocking issues are:

  1. README documents env vars the script doesn't use (#1) — will mislead operators
  2. System prompt should use --system-prompt flag (#3) — preserves prompt semantics and is a straightforward fix

The other items are lower severity but worth addressing.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Copy link
Contributor Author

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all 7 review items. Changes pushed in cbb5682.

#1 — README env vars (fixed): Removed ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL rows from the configuration table — the rewritten script doesn't read these.

#2 — Retry logic (fixed): Restored retry logic (_MAX_RETRIES = 1) with retry on subprocess.TimeoutExpired and RuntimeError, matching the resilience of the old httpx-based implementation.

#3 — System prompt semantics (fixed): Switched from prompt concatenation to --system-prompt flag, preserving proper system/user prompt separation. Removed the \n---\n\n delimiter hack.

#4 — Extensionless Python scripts (fixed): Added shebang-based scanning — the linter now also checks files without .py extension that start with a Python shebang (#!/usr/bin/env python3). This covers egg-health-inspect and similar scripts.

#5_suggest_alias fallback (fixed): Refactored to pass match.group(3) (the captured family name) directly to _suggest_alias, eliminating the arbitrary "sonnet" fallback. Updated tests accordingly.

#6claude-haiku-4-5 model ID (clarified): claude-haiku-4-5 is a valid unpinned API identifier accepted by the Messages API. Updated the noqa comment to be more precise: "direct Messages API call requires model identifier, not Claude Code alias".

#7 — orchestrator.md description (fixed): Changed (sonnet) to (default model: sonnet) to clarify it's a configurable default, not hardcoded.

— Authored by egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed delta since 708a494 (1 new commit: cbb5682). No agent-mode design concerns.

The changes address the base review bot's feedback without introducing any agent-mode issues. The --system-prompt flag switch is a positive change — it properly uses Claude Code's system prompt mechanism instead of concatenation, which is better alignment with the single-Claude-pathway architecture.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from 708a494..cbb5682. All 7 items from the previous review have been addressed. The core changes (README cleanup, retry logic, --system-prompt flag, shebang scanning, alias fallback refactor, orchestrator.md wording) are correctly implemented.

Two new issues introduced by the fix commit:


1. # noqa: EGG201 comments pollute docstrings — 4 files

The noqa comments placed on the opening """ line of multi-line docstrings become part of the actual docstring content:

# gateway/transcript_buffer.py:1
"""  # noqa: EGG201 - docstring examples use full model IDs
Transcript buffer for capturing...

Python sees the docstring as: " # noqa: EGG201 - docstring examples use full model IDs\nTranscript buffer...". Same problem in:

  • scripts/check-model-versions.py:2 — module docstring starts with # noqa: EGG201 - ...
  • sandbox/llm/claude/runner.py:241run_agent_async docstring includes noqa text
  • shared/egg_contracts/usage.py:53get_model_pricing docstring includes noqa text

This isn't just cosmetic — help(), Sphinx, and any docstring-based tooling will show the noqa text as part of the documentation.

Fix: The linter's _has_noqa only checks node.lineno, which for multi-line strings is the opening """ line. To support clean suppression, extend _has_noqa to also check the line before node.lineno:

def _has_noqa(self, lineno: int) -> bool:
    """Check if a line (or preceding line) has a noqa: EGG201 comment."""
    for check_line in (lineno, lineno - 1):
        if 1 <= check_line <= len(self.source_lines):
            line = self.source_lines[check_line - 1]
            if f"noqa: {NOQA_CODE}" in line:
                return True
    return False

Then the noqa can go on a comment line above the docstring:

# noqa: EGG201 - docstring examples use full model IDs
"""
Transcript buffer for capturing Anthropic API request/response pairs.
...

2. Regex matches bare claude-sonnet / claude-opus / claude-haikuscripts/check-model-versions.py:47

The _MODEL_ID_RE regex allows all version components to be empty simultaneously, matching bare claude-{family} with no version number:

$ python3 -c "import re; print(re.search(r'\bclaude-(?:(\d+-\d+-|)(\d+-|))(sonnet|opus|haiku)(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b', 'claude-sonnet'))"
<re.Match object; span=(0, 14), match='claude-sonnet'>

These bare forms are not valid model identifiers (there's no claude-sonnet model — only claude-sonnet-4, claude-sonnet-4-20250514, etc.). No false positives exist in the current codebase, so this is not blocking, but it means the linter's error message would be misleading if someone wrote claude-sonnet in prose (not as a model ID).

A one-line fix using a lookahead to require at least one digit:

_MODEL_ID_RE = re.compile(
    r"\bclaude-(?=[\w-]*\d)"                  # lookahead: at least one digit somewhere
    r"(?:(\d+-\d+-|)(\d+-|))"                 # optional version prefix (3-5-, 3-)
    r"(sonnet|opus|haiku)"                     # model family
    r"(?:-[\d]+(?:-[\d]+)?)?(?:-\d{8})?\b"     # optional version/date suffix
)

Previous feedback verification

# Issue Status
1 README documents env vars script doesn't use FixedANTHROPIC_API_KEY and ANTHROPIC_BASE_URL rows removed
2 No retry logic Fixed_MAX_RETRIES = 1 restored with retry on TimeoutExpired and RuntimeError
3 System prompt merged into user prompt Fixed — now uses --system-prompt flag
4 Linter only scans .py files Fixed — shebang-based scanning added for extensionless files
5 _suggest_alias arbitrary fallback Fixed — uses match.group(3) directly
6 claude-haiku-4-5 model ID comment Fixed — comment clarified to "direct Messages API call requires model identifier"
7 orchestrator.md (sonnet) Fixed — changed to (default model: sonnet)

Minor observations (non-blocking)

  • No test for _has_python_shebang or the shebang scanning path in test_check_model_versions.py. The function is simple enough that this isn't blocking, but a unit test would catch regressions if the file detection logic changes.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Fix noqa comments polluting docstrings by moving them to preceding
comment lines and extending _has_noqa to check the line before
node.lineno. Fix regex matching bare claude-sonnet/opus/haiku by
adding a lookahead requiring at least one digit. Add tests for
preceding-line noqa suppression, bare family name exclusion, and
_has_python_shebang.
@james-in-a-box
Copy link
Contributor Author

Addressed both issues from the re-review. Changes pushed in 5d4b64f.

#1 — noqa comments polluting docstrings (fixed): Moved all noqa comments from inside docstrings to standalone comment lines on the preceding line. Extended _has_noqa to check lineno - 1 in addition to lineno, so the linter recognizes preceding-line suppression. Affected files: gateway/transcript_buffer.py, sandbox/llm/claude/runner.py, shared/egg_contracts/usage.py, scripts/check-model-versions.py.

#2 — Regex matches bare claude-sonnet (fixed): Added a lookahead (?=[\w-]*\d) to _MODEL_ID_RE requiring at least one digit after claude-. Bare claude-sonnet, claude-opus, claude-haiku no longer match.

Non-blocking observation — shebang test coverage (added): Added TestHasPythonShebang with 5 test cases covering python3 shebang, python shebang, bash shebang, no shebang, and nonexistent file. Also added tests for preceding-line noqa suppression and bare family name exclusion.

All 58 tests pass, ruff clean.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

Re-reviewed delta cbb5682..5d4b64f. Changes are confined to linter tooling (noqa comment placement, regex lookahead fix, _has_noqa preceding-line check) and test coverage additions. No changes to prompt construction, context pre-fetching, agent autonomy, or structured output patterns.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from cbb5682..5d4b64f. Both issues from the previous re-review have been addressed. Verification below.


Previous re-review feedback verification

# Issue Status
1 # noqa: EGG201 comments polluting docstrings Fixed — noqa comments moved to standalone comment lines preceding the docstring. _has_noqa extended to check lineno - 1. All four affected files (gateway/transcript_buffer.py, sandbox/llm/claude/runner.py, shared/egg_contracts/usage.py, scripts/check-model-versions.py) now have clean docstrings.
2 Regex matches bare claude-sonnet / claude-opus / claude-haiku Fixed — Added (?=[\w-]*\d) lookahead requiring at least one digit. Verified: bare claude-sonnet no longer matches.
Non-blocking: shebang test coverage AddressedTestHasPythonShebang added with 5 test cases. Also added tests for preceding-line noqa and bare family name exclusion. All 34 tests pass.

New issue found in this commit

1. README "How it works" step 3 is staleorchestrator/health_checks/README.md:164

3. Container calls the Anthropic Messages API (via gateway credential injection)
   with a system prompt instructing Claude to produce a JSON verdict

The container no longer calls the Anthropic Messages API directly. It now runs claude --print (Claude Code headless mode). This step should say something like:

3. Container runs `claude --print` with a system prompt instructing Claude
   to produce a JSON verdict

This is a documentation-only issue — the code is correct. The architecture description in docs/architecture/orchestrator.md was updated correctly (line 67), and the README's model/config sections (lines 153, 180) were updated correctly. Only this "How it works" bullet was missed.

Overall assessment

The code changes are solid. The linter, noqa suppression, regex, shebang scanning, retry logic, and --system-prompt usage are all correct. 34 tests pass, the linter runs clean on the codebase. The one remaining issue is a stale sentence in the README.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Contributor Author

Addressed the remaining suggestion from the re-review. Changes pushed in 37ff830.

#1 — README "How it works" step 3 is stale (fixed): Updated step 3 from "Container calls the Anthropic Messages API (via gateway credential injection)" to "Container runs claude --print" — matching the actual implementation.

All 34 tests pass, ruff and model alias linter clean.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

Re-reviewed delta since 5d4b64f. The only change is the README step 3 fix (updating "calls the Anthropic Messages API" to "runs claude --print"), which was the single suggestion from my last review. Correctly applied.

The broader PR remains well-aligned with agent-mode principles: egg-health-inspect uses the single-Claude-pathway (claude --print with --system-prompt), context is small metadata suitable for --max-turns 1, and the removal of the tester gap-finding feedback loop (_read_tester_gaps) actually eliminates an orchestrator-side post-processing pattern in favor of simpler direct feedback flow.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Enforce Claude Code pathway and model aliases

Reviewed delta from 5d4b64f..37ff830 (1 new commit). The commit fixes the stale README "How it works" step 3 that I flagged in the previous re-review.

Delta review

The change is a single-line documentation fix in orchestrator/health_checks/README.md:164:

- Container calls the Anthropic Messages API (via gateway credential injection)
+ Container runs `claude --print` with a system prompt instructing Claude to produce a JSON verdict

Correct — matches the actual implementation.

Previous feedback verification (cumulative)

# Issue Status
1 README documents env vars script doesn't use FixedANTHROPIC_API_KEY and ANTHROPIC_BASE_URL removed
2 No retry logic Fixed_MAX_RETRIES = 1 with retry on TimeoutExpired and RuntimeError
3 System prompt merged into user prompt Fixed — uses --system-prompt flag
4 Linter only scans .py files Fixed — shebang-based scanning added
5 _suggest_alias arbitrary fallback Fixed — uses match.group(3) directly
6 claude-haiku-4-5 model ID comment Fixed — comment clarified
7 orchestrator.md (sonnet) description Fixed — says (default model: sonnet)
8 noqa comments polluting docstrings Fixed — moved to preceding standalone comment lines, _has_noqa checks lineno - 1
9 Regex matches bare claude-sonnet Fixed — lookahead (?=[\w-]*\d) requires at least one digit
10 README "How it works" step 3 stale Fixed — this commit

No new issues found. All feedback from previous rounds has been properly addressed.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Contributor Author

egg is addressing review feedback...

@james-in-a-box
Copy link
Contributor Author

egg feedback addressed. View run logs

13 previous review(s) hidden.

@jwbron jwbron merged commit e391c6a into main Feb 22, 2026
22 checks passed
james-in-a-box bot pushed a commit that referenced this pull request Feb 22, 2026
Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox,
EGG200 linter) and PR #873 (Claude Code headless pathway, model alias
enforcement, EGG201 linter) to the agent-mode design guide, shared
review criteria, GHA prompt builder inline fallback, orchestrator
inline fallback, and workflow trigger paths.
jwbron pushed a commit that referenced this pull request Feb 22, 2026
* Update agent-mode design guide and reviewers for LLM boundary rules

Add architectural conventions from PR #868 (LLM-calls-only-in-sandbox,
EGG200 linter) and PR #873 (Claude Code headless pathway, model alias
enforcement, EGG201 linter) to the agent-mode design guide, shared
review criteria, GHA prompt builder inline fallback, orchestrator
inline fallback, and workflow trigger paths.

* Fix misplaced EGG201 reference and clarify item 6 vs 7 scope

---------

Co-authored-by: egg <egg@localhost>
Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant