Unify orchestrator and CLI container launch via shared config builder#594
Unify orchestrator and CLI container launch via shared config builder#594
Conversation
… init Three root causes fixed for pipeline container failures: 1. **Network mode**: Containers always launched on egg-isolated network regardless of mode. Public-mode (issue) pipelines now use egg-external network with correct gateway/orchestrator IPs. 2. **Missing CONTAINER_ID**: Orchestrator never set CONTAINER_ID env var in spawned containers, breaking gateway worktree path mapping (git wrapper sent "unknown" as container_id). 3. **Contract init**: Contract creation failures were silently swallowed, leaving pipelines without contracts. Now fails fast with cleanup. Additional improvements: - Subprocess stderr captured to log file for error signal context - Container log tail increased from 50 to 200 lines - Entrypoint reads EGG_PRIVATE_MODE env var instead of inferring from HTTPS_PROXY presence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orchestrator's StateStore called `git worktree prune` when initializing its state worktree. Since the orchestrator container doesn't have /home/egg/.egg-worktrees mounted, prune saw all gateway-managed worktrees as stale and deleted their admin dirs. This broke all container git operations with: fatal: not a git repository: .git/worktrees/egg Replace `git worktree prune` with _remove_stale_admin_dir() that only removes the admin dir belonging to the state worktree, leaving gateway-managed worktree admin dirs intact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orchestrator was building container configs independently from the CLI, causing missing .git shadowing, no --add-host for gateway hostname, GATEWAY_URL overwritten by gateway_client, no DNS lockdown for private mode, and missing proxy env vars. This commit introduces a shared config builder (build_sandbox_config + to_dockerpy_kwargs) so both CLI and orchestrator produce identical container configurations. Key changes: - Add MountSpec, SandboxContainerConfig dataclasses and build_sandbox_config() builder to shared/egg_container - Add git_shadow_mounts() for .git shadow mount generation (uses /dev/null bind mount for worktrees, tmpfs for regular repos) - Add to_dockerpy_kwargs() adapter using Docker API mounts list format to support multiple repos with same source (no dict key collision) - Refactor build_sandbox_docker_cmd() to use build_sandbox_config() internally - Refactor orchestrator container_spawner to use shared config builder - Remove get_container_env() and get_proxy_config() from gateway_client (subsumed by shared config builder) - Refactor runtime.py to use shared git_shadow_mounts() - Add egg_container to orchestrator Dockerfile COPY - Add comprehensive tests (42 new tests for config builder) - Update orchestrator tests for new spawn_agent_container() signature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Review: Unify orchestrator and CLI container launch via shared config builder
Good architectural direction. Eliminating the divergence between CLI and orchestrator container configurations by extracting a shared config builder addresses a real class of bugs. The git_shadow_mounts, build_sandbox_config, and to_dockerpy_kwargs abstractions are clean and well-tested. The worktree prune fix in state_store.py is an important correctness fix.
Several issues to address:
1. Bug: EGG_PRIVATE_MODE vs PRIVATE_MODE env var mismatch
sandbox/entrypoint.py:902 — The entrypoint reads EGG_PRIVATE_MODE to detect private mode:
private_mode_env = os.environ.get("EGG_PRIVATE_MODE", "").lower()But the shared config builder (shared/egg_container/__init__.py:203) sets PRIVATE_MODE (without the EGG_ prefix):
env["PRIVATE_MODE"] = "true"So containers launched via the new shared builder will never have EGG_PRIVATE_MODE set. The entrypoint's new detection code will always fall through to the legacy fallback (proxy presence inference), which defeats the purpose of making detection explicit.
The sandbox_template.py:137 (not changed in this PR) also sets EGG_PRIVATE_MODE, confirming that's the expected env var name inside containers. The PRIVATE_MODE (no prefix) is used by the gateway process itself, not by sandbox containers.
Fix: Either change the config builder to set EGG_PRIVATE_MODE in addition to (or instead of) PRIVATE_MODE, or change the entrypoint to read PRIVATE_MODE. The former is safer since sandbox_template.py already uses EGG_PRIVATE_MODE.
2. Bug: Duplicate host_uid/host_gid parsing in container_spawner.py
container_spawner.py:273-274 parses HOST_UID/HOST_GID inside the gateway session registration block:
host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))Then container_spawner.py:305-306 parses the same values again outside that block:
host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))The first parse at line 273 is inside a try block that can be skipped on GatewayError. If the gateway registration fails, the first parse is wasted. The second parse at line 305 always runs. This is harmless but is clearly a copy-paste artifact — the first occurrence should be removed, or the parsing should be hoisted above both blocks.
3. mount_spec_to_cli_args silently drops volume-type mounts
shared/egg_container/__init__.py:127-140 — The function handles bind and tmpfs mount types, but returns [] for volume type mounts. The build_sandbox_docker_cmd CLI path iterates through config.mounts and calls mount_spec_to_cli_args for each — any volume mounts would be silently dropped.
Currently the CLI path doesn't pass mounts to build_sandbox_config, so this isn't triggered yet. But it's a latent bug since MountSpec supports mount_type="volume" and the orchestrator path handles it. If a caller ever passes a volume mount through the CLI path, it would silently vanish.
Suggestion: Either handle the volume case (--mount type=volume,source=...,destination=...) or raise an error for unsupported types.
4. _tee_stderr_to_file unbounded file growth
sandbox/entrypoint.py:1290-1309 — The tee function writes every stderr line to /tmp/egg-subprocess-stderr.log without any size limit. Long-running containers producing substantial stderr output will fill /tmp. Only the tail is ever read back (_read_subprocess_stderr_tail), so the intermediate data is waste.
Consider either:
- Using a ring buffer (e.g.,
collections.deque(maxlen=N)) and writing only at the end - Truncating/rotating the file periodically
- Since you only need the last N lines, keeping a bounded in-memory buffer and writing to disk at process exit would be simpler and safer
5. Minor: _SUBPROCESS_STDERR_LOG placement breaks import readability
sandbox/entrypoint.py:34-35:
_SUBPROCESS_STDERR_LOG = Path("/tmp/egg-subprocess-stderr.log")
from typing import Any, ClassVarA module-level assignment is wedged between import blocks. Move it after all imports for clarity.
6. Signal handling interaction with stderr PIPE
sandbox/entrypoint.py:1322-1337 — With stderr=subprocess.PIPE, if the parent process receives a signal (SIGTERM during container shutdown), process.wait() may return while the tee thread is still reading from the pipe. The tee_thread.join(timeout=5) handles this, but if the child writes a burst of stderr just before termination, up to 5 seconds of shutdown delay is added. This may be acceptable, but worth noting that the previous subprocess.run(stderr=sys.stderr) approach had zero shutdown overhead for this case.
7. Contract creation failure handling — error message leaks internal details
orchestrator/routes/pipelines.py:370-373 and pipelines.py:441-444 — On contract creation failure, the error response includes the raw exception:
return make_error_response(
f"Failed to create contract: {contract_err}",
status_code=500,
)This is an improvement over silently continuing with a broken pipeline, which is good. But the raw exception string may contain internal paths or stack details. If this endpoint is exposed beyond internal orchestrator use, consider sanitizing the error message.
8. assume_worktree docstring is misleading
shared/egg_container/__init__.py:84-85 — The docstring says:
assume_worktree: If True, always use tmpfs
But the actual behavior when assume_worktree=True is to use /dev/null bind mounts (file-over-file), not tmpfs. The code comment at line 95-99 correctly explains the reasoning. The docstring should be updated to match.
Overall the design is sound — the shared config builder is a meaningful consolidation that eliminates a real category of configuration drift bugs. The state_store.py worktree prune fix correctly addresses a subtle cross-container filesystem visibility issue. Items 1 and 8 need fixing before merge; the rest are improvements.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
|
@james-in-a-box address the above feedback |
|
Addressed the review feedback. Here's what changed: 1. Bug fix:
|
There was a problem hiding this comment.
Re-review: Unify orchestrator and CLI container launch via shared config builder
All eight items from the previous review have been addressed. The fixes are substantive, not superficial. One new issue found in the cleanup path.
Previous feedback — verification
#1 (Bug: EGG_PRIVATE_MODE mismatch): Fixed. build_sandbox_config() now sets both PRIVATE_MODE and EGG_PRIVATE_MODE for both private and public modes (__init__.py:215-216 and __init__.py:224-225). The entrypoint at entrypoint.py:903 reads EGG_PRIVATE_MODE and will now get the correct value. Good.
#2 (Bug: Duplicate host_uid/host_gid parsing): Fixed. Parsing is hoisted to container_spawner.py:265-266, above both the gateway try block and the config builder call. The redundant parse is removed.
#3 (mount_spec_to_cli_args volume support): Fixed. Volume type is now handled at __init__.py:143-148, and unrecognized types raise ValueError at line 152. Three new tests cover this (test_volume_mount, test_volume_mount_readwrite, test_unsupported_mount_type_raises).
#4 (_tee_stderr_to_file bounded): Fixed. Uses collections.deque(maxlen=500) ring buffer at entrypoint.py:1307. Only the tail is written to disk when the stream ends. The unbounded growth issue is resolved.
#5 (_SUBPROCESS_STDERR_LOG placement): Fixed. Moved to entrypoint.py:37-38, after all imports and before the first section header. Clean.
#6 (Signal handling / shutdown delay): Acknowledged, no change needed. Acceptable.
#7 (Error message sanitization): Acknowledged, no change needed. Acceptable for internal endpoint.
#8 (assume_worktree docstring): Fixed. The git_shadow_mounts docstring at __init__.py:84-87 now correctly describes /dev/null bind mounts (file-over-file) behavior.
New issues
1. Bug: Pipeline cleanup on contract failure does not commit deletion
pipelines.py:365 and pipelines.py:436 — When contract creation fails, the cleanup calls store.delete_pipeline(pipeline.id, commit=False). But create_pipeline() (called just above) commits the pipeline file to the git worktree by default. So the deletion only removes the file from the filesystem — it is not committed to git.
On the next orchestrator restart (or any call that re-reads the worktree from git), the zombie pipeline reappears without a contract. The commit=False argument was presumably intended to avoid noisy commits during an error path, but it defeats the cleanup.
Fix: Pass commit=True (or omit commit to use the default), so the deletion is actually persisted. For issue pipelines in particular, delete_pipeline already commits when commit=True and the pipeline is not local-prefixed.
2. Stale comment: assume_worktree=True (tmpfs) in container_spawner.py
container_spawner.py:248:
# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).This is left over from before the docstring fix. assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The parenthetical should say (/dev/null bind) or simply be removed.
Item 1 is a correctness bug that should be fixed before merge. Item 2 is minor.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
Three bugs caused failures when the orchestrator spawns multiple containers (coder + reviewer) in the refine phase: 1. Session-container IP binding: register_session() hardcodes EGG_CONTAINER_IP but Docker assigns actual IPs dynamically. After container start, update the gateway session with the actual IP so the gateway accepts git operations from reviewer containers. 2. AgentRole mapping crash: REVIEWER and CHECKER roles have no ContractAgentRole equivalent. Changed map_agent_role_to_contract_role to return Optional and skip contract updates for unmappable roles, preventing KeyError on completion signals. 3. Decision queue race condition: get_decision() and get_pending_decisions() loaded pipeline state without holding the lock, allowing another thread to overwrite state between queue_decision() and wait_for_decision(). Added lock acquisition to both methods. Authored-by: egg
There was a problem hiding this comment.
Re-review: Unify orchestrator and CLI container launch via shared config builder
The branch has been rebased into a single commit (99ad2f2 Fix multi-agent pipeline bugs for refine phase). Reviewed the full PR diff against main.
Previous feedback — verification
Original review items #1–#8 were addressed in the prior round. The second review at ce25386 found two remaining issues. Neither has been fixed.
1. Bug (unfixed from prior review): delete_pipeline(commit=False) leaves zombie pipelines in git
pipelines.py:365 and pipelines.py:436 — When contract creation fails, cleanup calls:
store.delete_pipeline(pipeline.id, commit=False)create_pipeline() at state_store.py:484 calls save_pipeline(pipeline, message=commit_msg) with default commit=True, which commits the pipeline file to the git worktree. The subsequent delete_pipeline(..., commit=False) only calls path.unlink() — it removes the file from disk but does NOT commit the deletion.
On orchestrator restart (which re-reads state from git), the pipeline file reappears as a zombie without a contract. For issue pipelines specifically, is_local is False and should_commit = commit and (not is_local or force_commit) = False and True = False, confirming the deletion is not committed.
Fix: Pass commit=True (or omit the commit kwarg) so the deletion is persisted to git. The save_pipeline that created the pipeline already committed, so the deletion must also commit to undo it.
2. Stale comment (unfixed from prior review): assume_worktree=True (tmpfs)
container_spawner.py:248:
# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The git_shadow_mounts docstring at __init__.py:84-87 was corrected in the prior round, but this caller-side comment was not updated.
3. Stale docstring: _spawn_and_wait says "tmpfs" for .git shadow
pipelines.py:1304-1305:
with .git shadowed by tmpfs to force gateway git operations.Same issue as #2. The orchestrator path uses assume_worktree=True, which produces /dev/null bind mounts, not tmpfs.
4. complete_agent silently drops outputs for non-contract roles
dispatch.py:189-199 — When contract_role is None (REVIEWER, CHECKER), save_agent_output is skipped:
if contract_role is not None:
self.contract_orchestrator.complete_agent(...)
if outputs:
save_agent_output(self.repo_path, contract_role, outputs)The save_agent_output call is nested inside the if contract_role is not None block, so any outputs passed for a REVIEWER or CHECKER completion are silently discarded. If this is intentional (reviewers/checkers store verdicts elsewhere), add a comment explaining why. If not, the save_agent_output should be moved outside the if block, though it would need a non-None key to save under.
Item 1 is a correctness bug that should be fixed before merge. Items 2–3 are documentation bugs. Item 4 needs clarification.
— Authored by egg
|
@james-in-a-box address this feedback: #594 (review) |
This comment has been minimized.
This comment has been minimized.
|
Addressed the review feedback. Here's what changed: 1. Bug fix:
|
There was a problem hiding this comment.
Re-review: Unify orchestrator and CLI container launch via shared config builder
All four items from the previous review have been properly addressed.
Previous feedback — verification
#1 (Bug: delete_pipeline(commit=False) zombie pipelines): Fixed. Both cleanup paths (pipelines.py:365 and pipelines.py:436) now call store.delete_pipeline(pipeline.id) with the default commit=True. For issue pipelines (is_local=False), delete_pipeline evaluates should_commit = True and (not False or False) = True, so the deletion is committed to git, matching the save_pipeline commit at creation. For local pipelines, should_commit evaluates to False, which is consistent with save_pipeline also skipping git commits for local pipelines (both disk-only). No zombie possible in either path.
#2 (Stale comment: assume_worktree=True (tmpfs)): Fixed. container_spawner.py:248 now reads (/dev/null bind).
#3 (Stale docstring: _spawn_and_wait says "tmpfs"): Fixed. pipelines.py:1305 now reads /dev/null bind mounts.
#4 (complete_agent silently drops outputs for non-contract roles): Addressed with explanatory comment at dispatch.py:197-200 explaining that REVIEWER/CHECKER roles don't have contract counterparts and their verdicts are stored in the AgentExecution record. Clear and correct.
New code — no blocking issues found
Examined the full delta. The changes are clean and well-scoped. A few minor advisory notes:
1. _tee_stderr_to_file bare except Exception: pass (entrypoint.py:1319)
The entire tee thread body is wrapped in except Exception: pass. If the tee thread fails silently (e.g., sys.stderr.buffer write fails), stderr output is lost and _SUBPROCESS_STDERR_LOG is never written. The process continues fine, but error signals to the orchestrator will have no stderr context. Acceptable for now given this is best-effort diagnostic data, but a logger.debug in the except clause would aid debugging.
2. to_dockerpy_kwargs does not forward container_ip (__init__.py:274-292)
SandboxContainerConfig carries container_ip but to_dockerpy_kwargs doesn't include it in the returned kwargs. The orchestrator path doesn't currently need static IPs (it uses _get_container_ip after start + update_session), so this is not a bug today. But it's a subtle gap between the CLI path (which passes --ip via build_sandbox_docker_cmd) and the docker-py path. If a future caller passes container_ip through the orchestrator path expecting it to be applied, it would be silently ignored.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
|
@james-in-a-box address the non-blocking issues here #594 (review) |
|
Addressed the non-blocking advisory notes. Here's what changed: 1.
|
There was a problem hiding this comment.
Re-review: Unify orchestrator and CLI container launch via shared config builder
The branch has been rebased into a single squashed commit (7ef3fa5). Reviewed the full PR diff. Both non-blocking suggestions from the previous approval have been addressed.
Previous feedback — verification
#1 (_tee_stderr_to_file bare except Exception: pass): Fixed. The except clause at entrypoint.py:1319-1328 now prints a [DEBUG] message to stderr before swallowing the exception. Best-effort diagnostic failures are no longer completely silent.
#2 (to_dockerpy_kwargs does not forward container_ip): Fixed. When config.container_ip is set, to_dockerpy_kwargs at __init__.py:291-298 now produces a networking_config dict with EndpointsConfig containing the IPv4Address under the config's network name. Two new tests (test_container_ip_produces_networking_config, test_no_container_ip_omits_networking_config) cover both the positive and negative cases. The networking_config passes through docker_client.create_container()'s **kwargs and coexists correctly with the named network parameter in Docker SDK 7.x.
New code — no blocking issues found
Examined the full delta systematically. The implementation is clean.
Advisory notes (non-blocking):
1. _read_subprocess_stderr_tail race if tee thread join times out
In _run_with_stderr_capture (entrypoint.py:1355-1356), process.wait() followed by tee_thread.join(timeout=5) is correct — once the child exits, the write end of the stderr pipe closes, so readline() returns b"" and the thread exits the loop and flushes the ring buffer to disk. However, if the join times out (e.g., the thread is stuck in the disk write on a degraded filesystem), _read_subprocess_stderr_tail(30) at line 1362 may read a partially-written or absent log file. The impact is minor (truncated stderr context in the error signal, not data corruption) and the 5-second timeout is generous. Acceptable as-is.
2. volumes=volumes or {} in docker_client.create_container — vestigial parameter
docker_client.py:188 always passes volumes=volumes or {} to the Docker SDK. The orchestrator path now uses mounts (list format via to_dockerpy_kwargs), not volumes (dict format). Since to_dockerpy_kwargs doesn't include volumes in its kwargs, the SDK receives volumes={} alongside mounts=[...]. This works correctly — volumes={} is a no-op and mounts takes precedence. But the volumes parameter on create_container is now unused by the orchestrator path and could be cleaned up in a follow-up to reduce confusion about which mount API is canonical.
— Authored by egg
|
egg review completed. View run logs 4 previous review(s) hidden. |
Update documentation to reflect changes from PR #594 which unified orchestrator and CLI container launch via a shared config builder. Changes: - Add egg_container to STRUCTURE.md shared libraries section - Update shared/README.md with new egg_container functionality: - build_sandbox_config() framework-agnostic config builder - to_dockerpy_kwargs() adapter for docker-py SDK - git_shadow_mounts() for .git mount generation - SandboxContainerConfig and MountSpec dataclasses Triggered by: https://github.com/anthropics/egg/pull/594 Authored-by: egg
* docs: Update shared library docs for egg_container Update documentation to reflect changes from PR #594 which unified orchestrator and CLI container launch via a shared config builder. Changes: - Add egg_container to STRUCTURE.md shared libraries section - Update shared/README.md with new egg_container functionality: - build_sandbox_config() framework-agnostic config builder - to_dockerpy_kwargs() adapter for docker-py SDK - git_shadow_mounts() for .git mount generation - SandboxContainerConfig and MountSpec dataclasses Triggered by: https://github.com/anthropics/egg/pull/594 Authored-by: egg * Fix incorrect API usage examples in shared/README.md - build_sandbox_config(): Remove non-existent params repo_volumes and container_id, add required image param - build_sandbox_docker_cmd(): Use keyword-only args matching actual function signature instead of positional arg --------- Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com> Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>
Summary
Fixes orchestrator pipeline container launch failures by unifying the container configuration path with the CLI. The orchestrator was building container configs independently, causing multiple networking, mount, and environment bugs that prevented containers from functioning correctly.
Commit 1: Fix pipeline container networking, logging, and contract init
egg-isolatedregardless of mode. Public-mode (issue) pipelines now useegg-externalwith correct gateway/orchestrator IPsCONTAINER_IDenv var, breaking gateway worktree path mapping (git wrapper sent "unknown" as container_id)EGG_PRIVATE_MODEenv var instead of inferring fromHTTPS_PROXYCommit 2: Fix worktree prune destroying gateway worktrees
StateStorecalledgit worktree pruneat init, which saw all gateway-managed worktrees as stale (orchestrator doesn't mount/home/egg/.egg-worktrees) and deleted their admin dirs_remove_stale_admin_dir()that only removes the state worktree's admin dirCommit 3: Unify orchestrator and CLI container launch via shared config builder
build_sandbox_config()+to_dockerpy_kwargs()inshared/egg_container/so both CLI and orchestrator produce identical container configurations.gitshadow mounts,--add-hostfor gateway hostname, DNS lockdown for private mode, proxy env vars, and correctGATEWAY_URLGATEWAY_URLbeing overwritten by the now-removedget_container_env()ingateway_client.pymountslist format (instead of legacyvolumesdict) to support multiple repos without key collision.gitshadow uses/dev/nullbind mount (file-over-file) for worktrees, fixing the "not a directory" OCI runtime error when tmpfs was mounted over a worktree.gitfilesandbox/egg_lib/runtime.pyto use sharedgit_shadow_mounts()Test plan
tests/shared/egg_container/test_config_builder.py)get_proxy_config/get_container_envtests)egg --local— verify CLI behavior unchanged.gitshadowing, gateway hostname resolution, proxy/DNS for private mode