Unify orchestrator and CLI container launch via shared config builder by jwbron · Pull Request #594 · jwbron/egg

jwbron · 2026-02-13T03:36:22Z

Summary

Fixes orchestrator pipeline container launch failures by unifying the container configuration path with the CLI. The orchestrator was building container configs independently, causing multiple networking, mount, and environment bugs that prevented containers from functioning correctly.

Commit 1: Fix pipeline container networking, logging, and contract init

Network mode: Containers always launched on egg-isolated regardless of mode. Public-mode (issue) pipelines now use egg-external with correct gateway/orchestrator IPs
Missing CONTAINER_ID: Orchestrator never set CONTAINER_ID env var, breaking gateway worktree path mapping (git wrapper sent "unknown" as container_id)
Contract init: Failures were silently swallowed, leaving pipelines without contracts. Now fails fast with cleanup
Subprocess stderr captured to log file; container log tail increased to 200 lines
Entrypoint reads EGG_PRIVATE_MODE env var instead of inferring from HTTPS_PROXY

Commit 2: Fix worktree prune destroying gateway worktrees

StateStore called git worktree prune at init, which saw all gateway-managed worktrees as stale (orchestrator doesn't mount /home/egg/.egg-worktrees) and deleted their admin dirs
Replaced with targeted _remove_stale_admin_dir() that only removes the state worktree's admin dir

Commit 3: Unify orchestrator and CLI container launch via shared config builder

Introduces build_sandbox_config() + to_dockerpy_kwargs() in shared/egg_container/ so both CLI and orchestrator produce identical container configurations
Fixes containers missing: .git shadow mounts, --add-host for gateway hostname, DNS lockdown for private mode, proxy env vars, and correct GATEWAY_URL
Fixes GATEWAY_URL being overwritten by the now-removed get_container_env() in gateway_client.py
Uses Docker API mounts list format (instead of legacy volumes dict) to support multiple repos without key collision
.git shadow uses /dev/null bind mount (file-over-file) for worktrees, fixing the "not a directory" OCI runtime error when tmpfs was mounted over a worktree .git file
Refactors sandbox/egg_lib/runtime.py to use shared git_shadow_mounts()

Test plan

42 new config builder tests pass (tests/shared/egg_container/test_config_builder.py)
27 existing CLI command builder tests pass (no regressions)
31 gateway client tests pass (removed obsolete get_proxy_config/get_container_env tests)
210 non-docker-dependent orchestrator tests pass (updated spawner tests for new signature)
Deploy and run egg --local — verify CLI behavior unchanged
Deploy and run pipeline via orchestrator — verify .git shadowing, gateway hostname resolution, proxy/DNS for private mode

… init Three root causes fixed for pipeline container failures: 1. **Network mode**: Containers always launched on egg-isolated network regardless of mode. Public-mode (issue) pipelines now use egg-external network with correct gateway/orchestrator IPs. 2. **Missing CONTAINER_ID**: Orchestrator never set CONTAINER_ID env var in spawned containers, breaking gateway worktree path mapping (git wrapper sent "unknown" as container_id). 3. **Contract init**: Contract creation failures were silently swallowed, leaving pipelines without contracts. Now fails fast with cleanup. Additional improvements: - Subprocess stderr captured to log file for error signal context - Container log tail increased from 50 to 200 lines - Entrypoint reads EGG_PRIVATE_MODE env var instead of inferring from HTTPS_PROXY presence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The orchestrator's StateStore called `git worktree prune` when initializing its state worktree. Since the orchestrator container doesn't have /home/egg/.egg-worktrees mounted, prune saw all gateway-managed worktrees as stale and deleted their admin dirs. This broke all container git operations with: fatal: not a git repository: .git/worktrees/egg Replace `git worktree prune` with _remove_stale_admin_dir() that only removes the admin dir belonging to the state worktree, leaving gateway-managed worktree admin dirs intact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The orchestrator was building container configs independently from the CLI, causing missing .git shadowing, no --add-host for gateway hostname, GATEWAY_URL overwritten by gateway_client, no DNS lockdown for private mode, and missing proxy env vars. This commit introduces a shared config builder (build_sandbox_config + to_dockerpy_kwargs) so both CLI and orchestrator produce identical container configurations. Key changes: - Add MountSpec, SandboxContainerConfig dataclasses and build_sandbox_config() builder to shared/egg_container - Add git_shadow_mounts() for .git shadow mount generation (uses /dev/null bind mount for worktrees, tmpfs for regular repos) - Add to_dockerpy_kwargs() adapter using Docker API mounts list format to support multiple repos with same source (no dict key collision) - Refactor build_sandbox_docker_cmd() to use build_sandbox_config() internally - Refactor orchestrator container_spawner to use shared config builder - Remove get_container_env() and get_proxy_config() from gateway_client (subsumed by shared config builder) - Refactor runtime.py to use shared git_shadow_mounts() - Add egg_container to orchestrator Dockerfile COPY - Add comprehensive tests (42 new tests for config builder) - Update orchestrator tests for new spawn_agent_container() signature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

egg-reviewer

Review: Unify orchestrator and CLI container launch via shared config builder

Good architectural direction. Eliminating the divergence between CLI and orchestrator container configurations by extracting a shared config builder addresses a real class of bugs. The git_shadow_mounts, build_sandbox_config, and to_dockerpy_kwargs abstractions are clean and well-tested. The worktree prune fix in state_store.py is an important correctness fix.

Several issues to address:

1. Bug: `EGG_PRIVATE_MODE` vs `PRIVATE_MODE` env var mismatch

sandbox/entrypoint.py:902 — The entrypoint reads EGG_PRIVATE_MODE to detect private mode:

private_mode_env = os.environ.get("EGG_PRIVATE_MODE", "").lower()

But the shared config builder (shared/egg_container/__init__.py:203) sets PRIVATE_MODE (without the EGG_ prefix):

env["PRIVATE_MODE"] = "true"

So containers launched via the new shared builder will never have EGG_PRIVATE_MODE set. The entrypoint's new detection code will always fall through to the legacy fallback (proxy presence inference), which defeats the purpose of making detection explicit.

The sandbox_template.py:137 (not changed in this PR) also sets EGG_PRIVATE_MODE, confirming that's the expected env var name inside containers. The PRIVATE_MODE (no prefix) is used by the gateway process itself, not by sandbox containers.

Fix: Either change the config builder to set EGG_PRIVATE_MODE in addition to (or instead of) PRIVATE_MODE, or change the entrypoint to read PRIVATE_MODE. The former is safer since sandbox_template.py already uses EGG_PRIVATE_MODE.

2. Bug: Duplicate `host_uid`/`host_gid` parsing in `container_spawner.py`

container_spawner.py:273-274 parses HOST_UID/HOST_GID inside the gateway session registration block:

host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))

Then container_spawner.py:305-306 parses the same values again outside that block:

host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))

The first parse at line 273 is inside a try block that can be skipped on GatewayError. If the gateway registration fails, the first parse is wasted. The second parse at line 305 always runs. This is harmless but is clearly a copy-paste artifact — the first occurrence should be removed, or the parsing should be hoisted above both blocks.

3. `mount_spec_to_cli_args` silently drops volume-type mounts

shared/egg_container/__init__.py:127-140 — The function handles bind and tmpfs mount types, but returns [] for volume type mounts. The build_sandbox_docker_cmd CLI path iterates through config.mounts and calls mount_spec_to_cli_args for each — any volume mounts would be silently dropped.

Currently the CLI path doesn't pass mounts to build_sandbox_config, so this isn't triggered yet. But it's a latent bug since MountSpec supports mount_type="volume" and the orchestrator path handles it. If a caller ever passes a volume mount through the CLI path, it would silently vanish.

Suggestion: Either handle the volume case (--mount type=volume,source=...,destination=...) or raise an error for unsupported types.

4. `_tee_stderr_to_file` unbounded file growth

sandbox/entrypoint.py:1290-1309 — The tee function writes every stderr line to /tmp/egg-subprocess-stderr.log without any size limit. Long-running containers producing substantial stderr output will fill /tmp. Only the tail is ever read back (_read_subprocess_stderr_tail), so the intermediate data is waste.

Consider either:

Using a ring buffer (e.g., collections.deque(maxlen=N)) and writing only at the end
Truncating/rotating the file periodically
Since you only need the last N lines, keeping a bounded in-memory buffer and writing to disk at process exit would be simpler and safer

5. Minor: `_SUBPROCESS_STDERR_LOG` placement breaks import readability

sandbox/entrypoint.py:34-35:

_SUBPROCESS_STDERR_LOG = Path("/tmp/egg-subprocess-stderr.log")
from typing import Any, ClassVar

A module-level assignment is wedged between import blocks. Move it after all imports for clarity.

6. Signal handling interaction with stderr PIPE

sandbox/entrypoint.py:1322-1337 — With stderr=subprocess.PIPE, if the parent process receives a signal (SIGTERM during container shutdown), process.wait() may return while the tee thread is still reading from the pipe. The tee_thread.join(timeout=5) handles this, but if the child writes a burst of stderr just before termination, up to 5 seconds of shutdown delay is added. This may be acceptable, but worth noting that the previous subprocess.run(stderr=sys.stderr) approach had zero shutdown overhead for this case.

7. Contract creation failure handling — error message leaks internal details

orchestrator/routes/pipelines.py:370-373 and pipelines.py:441-444 — On contract creation failure, the error response includes the raw exception:

return make_error_response(
    f"Failed to create contract: {contract_err}",
    status_code=500,
)

This is an improvement over silently continuing with a broken pipeline, which is good. But the raw exception string may contain internal paths or stack details. If this endpoint is exposed beyond internal orchestrator use, consider sanitizing the error message.

8. `assume_worktree` docstring is misleading

shared/egg_container/__init__.py:84-85 — The docstring says:

assume_worktree: If True, always use tmpfs

But the actual behavior when assume_worktree=True is to use /dev/null bind mounts (file-over-file), not tmpfs. The code comment at line 95-99 correctly explains the reasoning. The docstring should be updated to match.

Overall the design is sound — the shared config builder is a meaningful consolidation that eliminates a real category of configuration drift bugs. The state_store.py worktree prune fix correctly addresses a subtle cross-container filesystem visibility issue. Items 1 and 8 need fixing before merge; the rest are improvements.

— Authored by egg

jwbron · 2026-02-13T03:50:47Z

@james-in-a-box address the above feedback

james-in-a-box · 2026-02-13T03:56:54Z

Addressed the review feedback. Here's what changed:

1. Bug fix: `EGG_PRIVATE_MODE` env var mismatch

The config builder now sets both PRIVATE_MODE (for the gateway process) and EGG_PRIVATE_MODE (for sandbox containers) in both private and public modes. This ensures the entrypoint's explicit detection at sandbox/entrypoint.py:902 works correctly instead of falling through to the legacy proxy-presence inference.

2. Bug fix: Duplicate `host_uid`/`host_gid` parsing

Hoisted host_uid/host_gid parsing above both the gateway registration try block and the config builder call in container_spawner.py, eliminating the redundant parse at the old line 305.

3. `mount_spec_to_cli_args` now handles volume mounts

Added volume type support (--mount type=volume,source=...,destination=...) and raises ValueError for genuinely unsupported mount types instead of silently returning []. Added 3 new tests covering volume mount (readonly/readwrite) and the error case.

4. `_tee_stderr_to_file` bounded

Replaced the unbounded file write with a collections.deque(maxlen=500) ring buffer. Stderr is still tee'd to sys.stderr in real time, but only the last 500 lines are written to disk when the stream ends. This prevents /tmp exhaustion on long-running containers.

5. `_SUBPROCESS_STDERR_LOG` moved after imports

Moved the module-level constant from between import blocks to after all imports for readability.

6–7. Acknowledged

Items 6 (signal handling / 5s shutdown delay) and 7 (error message sanitization on internal endpoint) are noted. No code changes — the current behavior is acceptable for both.

8. `assume_worktree` docstring corrected

Updated from "always use tmpfs" to accurately describe the /dev/null bind mount (file-over-file) behavior.

Test results

45 config builder tests pass (3 new)
27 CLI command builder tests pass
31 gateway client tests pass
60 entrypoint tests pass

— Authored by egg

egg-reviewer

Re-review: Unify orchestrator and CLI container launch via shared config builder

All eight items from the previous review have been addressed. The fixes are substantive, not superficial. One new issue found in the cleanup path.

Previous feedback — verification

#1 (Bug: EGG_PRIVATE_MODE mismatch): Fixed. build_sandbox_config() now sets both PRIVATE_MODE and EGG_PRIVATE_MODE for both private and public modes (__init__.py:215-216 and __init__.py:224-225). The entrypoint at entrypoint.py:903 reads EGG_PRIVATE_MODE and will now get the correct value. Good.

#2 (Bug: Duplicate host_uid/host_gid parsing): Fixed. Parsing is hoisted to container_spawner.py:265-266, above both the gateway try block and the config builder call. The redundant parse is removed.

#3 (mount_spec_to_cli_args volume support): Fixed. Volume type is now handled at __init__.py:143-148, and unrecognized types raise ValueError at line 152. Three new tests cover this (test_volume_mount, test_volume_mount_readwrite, test_unsupported_mount_type_raises).

#4 (_tee_stderr_to_file bounded): Fixed. Uses collections.deque(maxlen=500) ring buffer at entrypoint.py:1307. Only the tail is written to disk when the stream ends. The unbounded growth issue is resolved.

#5 (_SUBPROCESS_STDERR_LOG placement): Fixed. Moved to entrypoint.py:37-38, after all imports and before the first section header. Clean.

#6 (Signal handling / shutdown delay): Acknowledged, no change needed. Acceptable.

#7 (Error message sanitization): Acknowledged, no change needed. Acceptable for internal endpoint.

#8 (assume_worktree docstring): Fixed. The git_shadow_mounts docstring at __init__.py:84-87 now correctly describes /dev/null bind mounts (file-over-file) behavior.

New issues

1. Bug: Pipeline cleanup on contract failure does not commit deletion

pipelines.py:365 and pipelines.py:436 — When contract creation fails, the cleanup calls store.delete_pipeline(pipeline.id, commit=False). But create_pipeline() (called just above) commits the pipeline file to the git worktree by default. So the deletion only removes the file from the filesystem — it is not committed to git.

On the next orchestrator restart (or any call that re-reads the worktree from git), the zombie pipeline reappears without a contract. The commit=False argument was presumably intended to avoid noisy commits during an error path, but it defeats the cleanup.

Fix: Pass commit=True (or omit commit to use the default), so the deletion is actually persisted. For issue pipelines in particular, delete_pipeline already commits when commit=True and the pipeline is not local-prefixed.

2. Stale comment: `assume_worktree=True (tmpfs)` in `container_spawner.py`

container_spawner.py:248:

# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).

This is left over from before the docstring fix. assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The parenthetical should say (/dev/null bind) or simply be removed.

Item 1 is a correctness bug that should be fixed before merge. Item 2 is minor.

— Authored by egg

Three bugs caused failures when the orchestrator spawns multiple containers (coder + reviewer) in the refine phase: 1. Session-container IP binding: register_session() hardcodes EGG_CONTAINER_IP but Docker assigns actual IPs dynamically. After container start, update the gateway session with the actual IP so the gateway accepts git operations from reviewer containers. 2. AgentRole mapping crash: REVIEWER and CHECKER roles have no ContractAgentRole equivalent. Changed map_agent_role_to_contract_role to return Optional and skip contract updates for unmappable roles, preventing KeyError on completion signals. 3. Decision queue race condition: get_decision() and get_pending_decisions() loaded pipeline state without holding the lock, allowing another thread to overwrite state between queue_decision() and wait_for_decision(). Added lock acquisition to both methods. Authored-by: egg

egg-reviewer

Re-review: Unify orchestrator and CLI container launch via shared config builder

The branch has been rebased into a single commit (99ad2f2 Fix multi-agent pipeline bugs for refine phase). Reviewed the full PR diff against main.

Previous feedback — verification

Original review items #1–#8 were addressed in the prior round. The second review at ce25386 found two remaining issues. Neither has been fixed.

1. Bug (unfixed from prior review): `delete_pipeline(commit=False)` leaves zombie pipelines in git

pipelines.py:365 and pipelines.py:436 — When contract creation fails, cleanup calls:

store.delete_pipeline(pipeline.id, commit=False)

create_pipeline() at state_store.py:484 calls save_pipeline(pipeline, message=commit_msg) with default commit=True, which commits the pipeline file to the git worktree. The subsequent delete_pipeline(..., commit=False) only calls path.unlink() — it removes the file from disk but does NOT commit the deletion.

On orchestrator restart (which re-reads state from git), the pipeline file reappears as a zombie without a contract. For issue pipelines specifically, is_local is False and should_commit = commit and (not is_local or force_commit) = False and True = False, confirming the deletion is not committed.

Fix: Pass commit=True (or omit the commit kwarg) so the deletion is persisted to git. The save_pipeline that created the pipeline already committed, so the deletion must also commit to undo it.

2. Stale comment (unfixed from prior review): `assume_worktree=True (tmpfs)`

container_spawner.py:248:

# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).

assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The git_shadow_mounts docstring at __init__.py:84-87 was corrected in the prior round, but this caller-side comment was not updated.

3. Stale docstring: `_spawn_and_wait` says "tmpfs" for .git shadow

pipelines.py:1304-1305:

    with .git shadowed by tmpfs to force gateway git operations.

Same issue as #2. The orchestrator path uses assume_worktree=True, which produces /dev/null bind mounts, not tmpfs.

4. `complete_agent` silently drops outputs for non-contract roles

dispatch.py:189-199 — When contract_role is None (REVIEWER, CHECKER), save_agent_output is skipped:

if contract_role is not None:
    self.contract_orchestrator.complete_agent(...)
    if outputs:
        save_agent_output(self.repo_path, contract_role, outputs)

The save_agent_output call is nested inside the if contract_role is not None block, so any outputs passed for a REVIEWER or CHECKER completion are silently discarded. If this is intentional (reviewers/checkers store verdicts elsewhere), add a comment explaining why. If not, the save_agent_output should be moved outside the if block, though it would need a non-None key to save under.

Item 1 is a correctness bug that should be fixed before merge. Items 2–3 are documentation bugs. Item 4 needs clarification.

— Authored by egg

jwbron · 2026-02-13T04:08:02Z

@james-in-a-box address this feedback: #594 (review)

james-in-a-box · 2026-02-13T04:13:55Z

Addressed the review feedback. Here's what changed:

1. Bug fix: `delete_pipeline(commit=False)` leaves zombie pipelines in git

Changed both contract failure cleanup paths (pipelines.py:365 and pipelines.py:436) from store.delete_pipeline(pipeline.id, commit=False) to store.delete_pipeline(pipeline.id) (default commit=True).

For issue pipelines, create_pipeline() commits the pipeline file to git via save_pipeline(commit=True). With commit=False on deletion, only the filesystem file was removed — on orchestrator restart the pipeline would reappear from git as a zombie without a contract. Now the deletion is committed, matching the creation.

For local pipelines, save_pipeline skips git commit unless force_commit=True, and delete_pipeline with commit=True also skips commit for local pipelines (via the is_local check), so the behavior is correct either way — but using the default is cleaner and consistent.

2. Stale comment: `assume_worktree=True (tmpfs)` in `container_spawner.py`

Updated container_spawner.py:248 from "(tmpfs)" to "(/dev/null bind)" to match the actual file-over-file bind mount behavior.

3. Stale docstring: `_spawn_and_wait` says "tmpfs" for .git shadow

Updated pipelines.py:1305 from "shadowed by tmpfs" to "shadowed by /dev/null bind mounts".

4. `complete_agent` silently drops outputs for non-contract roles

Intentional behavior. Added a comment at dispatch.py:197-200 explaining why: REVIEWER and CHECKER roles don't have contract counterparts (their map_agent_role_to_contract_role returns None), so there's no contract role key to save outputs under. Their verdicts are stored in the AgentExecution record returned by complete_agent, not as contract outputs. save_agent_output requires a ContractAgentRole which doesn't exist for these roles.

Test results

45 config builder tests pass
30 container spawner tests pass
31 gateway client tests pass
60 entrypoint tests pass (1 skipped)
1 pre-existing failure in test_state_store.py::TestStateValidation::test_load_invalid_schema_fails (unrelated)

— Authored by egg

egg-reviewer

Re-review: Unify orchestrator and CLI container launch via shared config builder

All four items from the previous review have been properly addressed.

Previous feedback — verification

#1 (Bug: delete_pipeline(commit=False) zombie pipelines): Fixed. Both cleanup paths (pipelines.py:365 and pipelines.py:436) now call store.delete_pipeline(pipeline.id) with the default commit=True. For issue pipelines (is_local=False), delete_pipeline evaluates should_commit = True and (not False or False) = True, so the deletion is committed to git, matching the save_pipeline commit at creation. For local pipelines, should_commit evaluates to False, which is consistent with save_pipeline also skipping git commits for local pipelines (both disk-only). No zombie possible in either path.

#2 (Stale comment: assume_worktree=True (tmpfs)): Fixed. container_spawner.py:248 now reads (/dev/null bind).

#3 (Stale docstring: _spawn_and_wait says "tmpfs"): Fixed. pipelines.py:1305 now reads /dev/null bind mounts.

#4 (complete_agent silently drops outputs for non-contract roles): Addressed with explanatory comment at dispatch.py:197-200 explaining that REVIEWER/CHECKER roles don't have contract counterparts and their verdicts are stored in the AgentExecution record. Clear and correct.

New code — no blocking issues found

Examined the full delta. The changes are clean and well-scoped. A few minor advisory notes:

1. _tee_stderr_to_file bare except Exception: pass (entrypoint.py:1319)

The entire tee thread body is wrapped in except Exception: pass. If the tee thread fails silently (e.g., sys.stderr.buffer write fails), stderr output is lost and _SUBPROCESS_STDERR_LOG is never written. The process continues fine, but error signals to the orchestrator will have no stderr context. Acceptable for now given this is best-effort diagnostic data, but a logger.debug in the except clause would aid debugging.

2. to_dockerpy_kwargs does not forward container_ip (__init__.py:274-292)

SandboxContainerConfig carries container_ip but to_dockerpy_kwargs doesn't include it in the returned kwargs. The orchestrator path doesn't currently need static IPs (it uses _get_container_ip after start + update_session), so this is not a bug today. But it's a subtle gap between the CLI path (which passes --ip via build_sandbox_docker_cmd) and the docker-py path. If a future caller passes container_ip through the orchestrator path expecting it to be applied, it would be silently ignored.

— Authored by egg

jwbron · 2026-02-13T04:19:35Z

@james-in-a-box address the non-blocking issues here #594 (review)

… forwarding

james-in-a-box · 2026-02-13T04:24:36Z

Addressed the non-blocking advisory notes. Here's what changed:

1. `_tee_stderr_to_file` bare `except Exception: pass` (`entrypoint.py:1319`)

Replaced with except Exception as exc: that prints a debug message to stderr before swallowing the error. The tee thread is still best-effort (it won't crash the container), but failures are no longer completely silent — the exception is logged to stderr for post-mortem diagnosis.

2. `to_dockerpy_kwargs` now forwards `container_ip` (`init.py:293-300`)

When config.container_ip is set, to_dockerpy_kwargs now includes a networking_config dict with the Docker Engine API's EndpointsConfig format, specifying the IPv4Address under the config's network. This closes the gap between the CLI path (which uses --ip) and the docker-py path. Two new tests cover the positive and negative cases.

Test results

47 config builder tests pass (2 new)
60 entrypoint tests pass (1 skipped, pre-existing)

— Authored by egg

egg-reviewer

Re-review: Unify orchestrator and CLI container launch via shared config builder

The branch has been rebased into a single squashed commit (7ef3fa5). Reviewed the full PR diff. Both non-blocking suggestions from the previous approval have been addressed.

Previous feedback — verification

#1 (_tee_stderr_to_file bare except Exception: pass): Fixed. The except clause at entrypoint.py:1319-1328 now prints a [DEBUG] message to stderr before swallowing the exception. Best-effort diagnostic failures are no longer completely silent.

#2 (to_dockerpy_kwargs does not forward container_ip): Fixed. When config.container_ip is set, to_dockerpy_kwargs at __init__.py:291-298 now produces a networking_config dict with EndpointsConfig containing the IPv4Address under the config's network name. Two new tests (test_container_ip_produces_networking_config, test_no_container_ip_omits_networking_config) cover both the positive and negative cases. The networking_config passes through docker_client.create_container()'s **kwargs and coexists correctly with the named network parameter in Docker SDK 7.x.

New code — no blocking issues found

Examined the full delta systematically. The implementation is clean.

Advisory notes (non-blocking):

1. _read_subprocess_stderr_tail race if tee thread join times out

In _run_with_stderr_capture (entrypoint.py:1355-1356), process.wait() followed by tee_thread.join(timeout=5) is correct — once the child exits, the write end of the stderr pipe closes, so readline() returns b"" and the thread exits the loop and flushes the ring buffer to disk. However, if the join times out (e.g., the thread is stuck in the disk write on a degraded filesystem), _read_subprocess_stderr_tail(30) at line 1362 may read a partially-written or absent log file. The impact is minor (truncated stderr context in the error signal, not data corruption) and the 5-second timeout is generous. Acceptable as-is.

2. volumes=volumes or {} in docker_client.create_container — vestigial parameter

docker_client.py:188 always passes volumes=volumes or {} to the Docker SDK. The orchestrator path now uses mounts (list format via to_dockerpy_kwargs), not volumes (dict format). Since to_dockerpy_kwargs doesn't include volumes in its kwargs, the SDK receives volumes={} alongside mounts=[...]. This works correctly — volumes={} is a no-op and mounts takes precedence. But the volumes parameter on create_container is now unused by the orchestrator path and could be cleaned up in a follow-up to reduce confusion about which mount API is canonical.

— Authored by egg

james-in-a-box · 2026-02-13T04:33:12Z

egg review completed. View run logs

4 previous review(s) hidden.

Update documentation to reflect changes from PR #594 which unified orchestrator and CLI container launch via a shared config builder. Changes: - Add egg_container to STRUCTURE.md shared libraries section - Update shared/README.md with new egg_container functionality: - build_sandbox_config() framework-agnostic config builder - to_dockerpy_kwargs() adapter for docker-py SDK - git_shadow_mounts() for .git mount generation - SandboxContainerConfig and MountSpec dataclasses Triggered by: https://github.com/anthropics/egg/pull/594 Authored-by: egg

* docs: Update shared library docs for egg_container Update documentation to reflect changes from PR #594 which unified orchestrator and CLI container launch via a shared config builder. Changes: - Add egg_container to STRUCTURE.md shared libraries section - Update shared/README.md with new egg_container functionality: - build_sandbox_config() framework-agnostic config builder - to_dockerpy_kwargs() adapter for docker-py SDK - git_shadow_mounts() for .git mount generation - SandboxContainerConfig and MountSpec dataclasses Triggered by: https://github.com/anthropics/egg/pull/594 Authored-by: egg * Fix incorrect API usage examples in shared/README.md - build_sandbox_config(): Remove non-existent params repo_volumes and container_id, add required image param - build_sandbox_docker_cmd(): Use keyword-only args matching actual function signature instead of positional arg --------- Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com> Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>

jwbron and others added 3 commits February 12, 2026 18:25

egg-reviewer bot suggested changes Feb 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Address review feedback on PR #594

ce25386

egg-reviewer bot suggested changes Feb 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

egg-reviewer bot suggested changes Feb 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Address review: fix zombie pipeline cleanup, stale tmpfs comments

db15451

egg-reviewer bot approved these changes Feb 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Address non-blocking review feedback: stderr logging and container_ip…

7ef3fa5

… forwarding

egg-reviewer bot approved these changes Feb 13, 2026

View reviewed changes

jwbron merged commit d9d3e23 into main Feb 13, 2026
15 checks passed

james-in-a-box bot mentioned this pull request Feb 13, 2026

docs: Update egg_container shared library docs [doc-updater] #597

Merged

Conversation

jwbron commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: Fix pipeline container networking, logging, and contract init

Commit 2: Fix worktree prune destroying gateway worktrees

Commit 3: Unify orchestrator and CLI container launch via shared config builder

Test plan

Uh oh!

egg-reviewer bot left a comment

Choose a reason for hiding this comment

Review: Unify orchestrator and CLI container launch via shared config builder

1. Bug: EGG_PRIVATE_MODE vs PRIVATE_MODE env var mismatch

2. Bug: Duplicate host_uid/host_gid parsing in container_spawner.py

3. mount_spec_to_cli_args silently drops volume-type mounts

4. _tee_stderr_to_file unbounded file growth

5. Minor: _SUBPROCESS_STDERR_LOG placement breaks import readability

6. Signal handling interaction with stderr PIPE

7. Contract creation failure handling — error message leaks internal details

8. assume_worktree docstring is misleading

Uh oh!

This comment has been minimized.

jwbron commented Feb 13, 2026

Uh oh!

james-in-a-box bot commented Feb 13, 2026

1. Bug fix: EGG_PRIVATE_MODE env var mismatch

2. Bug fix: Duplicate host_uid/host_gid parsing

3. mount_spec_to_cli_args now handles volume mounts

4. _tee_stderr_to_file bounded

5. _SUBPROCESS_STDERR_LOG moved after imports

6–7. Acknowledged

8. assume_worktree docstring corrected

Test results

Uh oh!

egg-reviewer bot left a comment

Choose a reason for hiding this comment

Re-review: Unify orchestrator and CLI container launch via shared config builder

Previous feedback — verification

New issues

1. Bug: Pipeline cleanup on contract failure does not commit deletion

2. Stale comment: assume_worktree=True (tmpfs) in container_spawner.py

Uh oh!

This comment has been minimized.

egg-reviewer bot left a comment

Choose a reason for hiding this comment

Re-review: Unify orchestrator and CLI container launch via shared config builder

Previous feedback — verification

1. Bug (unfixed from prior review): delete_pipeline(commit=False) leaves zombie pipelines in git

2. Stale comment (unfixed from prior review): assume_worktree=True (tmpfs)

3. Stale docstring: _spawn_and_wait says "tmpfs" for .git shadow

4. complete_agent silently drops outputs for non-contract roles

Uh oh!

jwbron commented Feb 13, 2026

Uh oh!

This comment has been minimized.

james-in-a-box bot commented Feb 13, 2026

1. Bug fix: delete_pipeline(commit=False) leaves zombie pipelines in git

2. Stale comment: assume_worktree=True (tmpfs) in container_spawner.py

3. Stale docstring: _spawn_and_wait says "tmpfs" for .git shadow

4. complete_agent silently drops outputs for non-contract roles

Test results

Uh oh!

egg-reviewer bot left a comment

Choose a reason for hiding this comment

Re-review: Unify orchestrator and CLI container launch via shared config builder

Previous feedback — verification

New code — no blocking issues found

Uh oh!

This comment has been minimized.

jwbron commented Feb 13, 2026

Uh oh!

james-in-a-box bot commented Feb 13, 2026

1. _tee_stderr_to_file bare except Exception: pass (entrypoint.py:1319)

2. to_dockerpy_kwargs now forwards container_ip (__init__.py:293-300)

Test results

Uh oh!

egg-reviewer bot left a comment

Choose a reason for hiding this comment

Re-review: Unify orchestrator and CLI container launch via shared config builder

Previous feedback — verification

jwbron commented Feb 13, 2026 •

edited

Loading

1. Bug: `EGG_PRIVATE_MODE` vs `PRIVATE_MODE` env var mismatch

2. Bug: Duplicate `host_uid`/`host_gid` parsing in `container_spawner.py`

3. `mount_spec_to_cli_args` silently drops volume-type mounts

4. `_tee_stderr_to_file` unbounded file growth

5. Minor: `_SUBPROCESS_STDERR_LOG` placement breaks import readability

8. `assume_worktree` docstring is misleading

1. Bug fix: `EGG_PRIVATE_MODE` env var mismatch

2. Bug fix: Duplicate `host_uid`/`host_gid` parsing

3. `mount_spec_to_cli_args` now handles volume mounts

4. `_tee_stderr_to_file` bounded

5. `_SUBPROCESS_STDERR_LOG` moved after imports

8. `assume_worktree` docstring corrected

2. Stale comment: `assume_worktree=True (tmpfs)` in `container_spawner.py`

1. Bug (unfixed from prior review): `delete_pipeline(commit=False)` leaves zombie pipelines in git

2. Stale comment (unfixed from prior review): `assume_worktree=True (tmpfs)`

3. Stale docstring: `_spawn_and_wait` says "tmpfs" for .git shadow

4. `complete_agent` silently drops outputs for non-contract roles

1. Bug fix: `delete_pipeline(commit=False)` leaves zombie pipelines in git

2. Stale comment: `assume_worktree=True (tmpfs)` in `container_spawner.py`

3. Stale docstring: `_spawn_and_wait` says "tmpfs" for .git shadow

4. `complete_agent` silently drops outputs for non-contract roles

1. `_tee_stderr_to_file` bare `except Exception: pass` (`entrypoint.py:1319`)

2. `to_dockerpy_kwargs` now forwards `container_ip` (`init.py:293-300`)