Skip to content

Unify orchestrator and CLI container launch via shared config builder#594

Merged
jwbron merged 7 commits intomainfrom
fix/orchestrator-pipeline-networking
Feb 13, 2026
Merged

Unify orchestrator and CLI container launch via shared config builder#594
jwbron merged 7 commits intomainfrom
fix/orchestrator-pipeline-networking

Conversation

@jwbron
Copy link
Owner

@jwbron jwbron commented Feb 13, 2026

Summary

Fixes orchestrator pipeline container launch failures by unifying the container configuration path with the CLI. The orchestrator was building container configs independently, causing multiple networking, mount, and environment bugs that prevented containers from functioning correctly.

Commit 1: Fix pipeline container networking, logging, and contract init

  • Network mode: Containers always launched on egg-isolated regardless of mode. Public-mode (issue) pipelines now use egg-external with correct gateway/orchestrator IPs
  • Missing CONTAINER_ID: Orchestrator never set CONTAINER_ID env var, breaking gateway worktree path mapping (git wrapper sent "unknown" as container_id)
  • Contract init: Failures were silently swallowed, leaving pipelines without contracts. Now fails fast with cleanup
  • Subprocess stderr captured to log file; container log tail increased to 200 lines
  • Entrypoint reads EGG_PRIVATE_MODE env var instead of inferring from HTTPS_PROXY

Commit 2: Fix worktree prune destroying gateway worktrees

  • StateStore called git worktree prune at init, which saw all gateway-managed worktrees as stale (orchestrator doesn't mount /home/egg/.egg-worktrees) and deleted their admin dirs
  • Replaced with targeted _remove_stale_admin_dir() that only removes the state worktree's admin dir

Commit 3: Unify orchestrator and CLI container launch via shared config builder

  • Introduces build_sandbox_config() + to_dockerpy_kwargs() in shared/egg_container/ so both CLI and orchestrator produce identical container configurations
  • Fixes containers missing: .git shadow mounts, --add-host for gateway hostname, DNS lockdown for private mode, proxy env vars, and correct GATEWAY_URL
  • Fixes GATEWAY_URL being overwritten by the now-removed get_container_env() in gateway_client.py
  • Uses Docker API mounts list format (instead of legacy volumes dict) to support multiple repos without key collision
  • .git shadow uses /dev/null bind mount (file-over-file) for worktrees, fixing the "not a directory" OCI runtime error when tmpfs was mounted over a worktree .git file
  • Refactors sandbox/egg_lib/runtime.py to use shared git_shadow_mounts()

Test plan

  • 42 new config builder tests pass (tests/shared/egg_container/test_config_builder.py)
  • 27 existing CLI command builder tests pass (no regressions)
  • 31 gateway client tests pass (removed obsolete get_proxy_config/get_container_env tests)
  • 210 non-docker-dependent orchestrator tests pass (updated spawner tests for new signature)
  • Deploy and run egg --local — verify CLI behavior unchanged
  • Deploy and run pipeline via orchestrator — verify .git shadowing, gateway hostname resolution, proxy/DNS for private mode

jwbron and others added 3 commits February 12, 2026 18:25
… init

Three root causes fixed for pipeline container failures:

1. **Network mode**: Containers always launched on egg-isolated network
   regardless of mode. Public-mode (issue) pipelines now use egg-external
   network with correct gateway/orchestrator IPs.

2. **Missing CONTAINER_ID**: Orchestrator never set CONTAINER_ID env var
   in spawned containers, breaking gateway worktree path mapping (git
   wrapper sent "unknown" as container_id).

3. **Contract init**: Contract creation failures were silently swallowed,
   leaving pipelines without contracts. Now fails fast with cleanup.

Additional improvements:
- Subprocess stderr captured to log file for error signal context
- Container log tail increased from 50 to 200 lines
- Entrypoint reads EGG_PRIVATE_MODE env var instead of inferring from
  HTTPS_PROXY presence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orchestrator's StateStore called `git worktree prune` when
initializing its state worktree. Since the orchestrator container
doesn't have /home/egg/.egg-worktrees mounted, prune saw all
gateway-managed worktrees as stale and deleted their admin dirs.
This broke all container git operations with:
  fatal: not a git repository: .git/worktrees/egg

Replace `git worktree prune` with _remove_stale_admin_dir() that
only removes the admin dir belonging to the state worktree, leaving
gateway-managed worktree admin dirs intact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orchestrator was building container configs independently from the CLI,
causing missing .git shadowing, no --add-host for gateway hostname, GATEWAY_URL
overwritten by gateway_client, no DNS lockdown for private mode, and missing
proxy env vars. This commit introduces a shared config builder
(build_sandbox_config + to_dockerpy_kwargs) so both CLI and orchestrator
produce identical container configurations.

Key changes:
- Add MountSpec, SandboxContainerConfig dataclasses and build_sandbox_config()
  builder to shared/egg_container
- Add git_shadow_mounts() for .git shadow mount generation (uses /dev/null
  bind mount for worktrees, tmpfs for regular repos)
- Add to_dockerpy_kwargs() adapter using Docker API mounts list format to
  support multiple repos with same source (no dict key collision)
- Refactor build_sandbox_docker_cmd() to use build_sandbox_config() internally
- Refactor orchestrator container_spawner to use shared config builder
- Remove get_container_env() and get_proxy_config() from gateway_client
  (subsumed by shared config builder)
- Refactor runtime.py to use shared git_shadow_mounts()
- Add egg_container to orchestrator Dockerfile COPY
- Add comprehensive tests (42 new tests for config builder)
- Update orchestrator tests for new spawn_agent_container() signature

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Unify orchestrator and CLI container launch via shared config builder

Good architectural direction. Eliminating the divergence between CLI and orchestrator container configurations by extracting a shared config builder addresses a real class of bugs. The git_shadow_mounts, build_sandbox_config, and to_dockerpy_kwargs abstractions are clean and well-tested. The worktree prune fix in state_store.py is an important correctness fix.

Several issues to address:


1. Bug: EGG_PRIVATE_MODE vs PRIVATE_MODE env var mismatch

sandbox/entrypoint.py:902 — The entrypoint reads EGG_PRIVATE_MODE to detect private mode:

private_mode_env = os.environ.get("EGG_PRIVATE_MODE", "").lower()

But the shared config builder (shared/egg_container/__init__.py:203) sets PRIVATE_MODE (without the EGG_ prefix):

env["PRIVATE_MODE"] = "true"

So containers launched via the new shared builder will never have EGG_PRIVATE_MODE set. The entrypoint's new detection code will always fall through to the legacy fallback (proxy presence inference), which defeats the purpose of making detection explicit.

The sandbox_template.py:137 (not changed in this PR) also sets EGG_PRIVATE_MODE, confirming that's the expected env var name inside containers. The PRIVATE_MODE (no prefix) is used by the gateway process itself, not by sandbox containers.

Fix: Either change the config builder to set EGG_PRIVATE_MODE in addition to (or instead of) PRIVATE_MODE, or change the entrypoint to read PRIVATE_MODE. The former is safer since sandbox_template.py already uses EGG_PRIVATE_MODE.


2. Bug: Duplicate host_uid/host_gid parsing in container_spawner.py

container_spawner.py:273-274 parses HOST_UID/HOST_GID inside the gateway session registration block:

host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))

Then container_spawner.py:305-306 parses the same values again outside that block:

host_uid = int(os.environ.get("HOST_UID", 1000))
host_gid = int(os.environ.get("HOST_GID", 1000))

The first parse at line 273 is inside a try block that can be skipped on GatewayError. If the gateway registration fails, the first parse is wasted. The second parse at line 305 always runs. This is harmless but is clearly a copy-paste artifact — the first occurrence should be removed, or the parsing should be hoisted above both blocks.


3. mount_spec_to_cli_args silently drops volume-type mounts

shared/egg_container/__init__.py:127-140 — The function handles bind and tmpfs mount types, but returns [] for volume type mounts. The build_sandbox_docker_cmd CLI path iterates through config.mounts and calls mount_spec_to_cli_args for each — any volume mounts would be silently dropped.

Currently the CLI path doesn't pass mounts to build_sandbox_config, so this isn't triggered yet. But it's a latent bug since MountSpec supports mount_type="volume" and the orchestrator path handles it. If a caller ever passes a volume mount through the CLI path, it would silently vanish.

Suggestion: Either handle the volume case (--mount type=volume,source=...,destination=...) or raise an error for unsupported types.


4. _tee_stderr_to_file unbounded file growth

sandbox/entrypoint.py:1290-1309 — The tee function writes every stderr line to /tmp/egg-subprocess-stderr.log without any size limit. Long-running containers producing substantial stderr output will fill /tmp. Only the tail is ever read back (_read_subprocess_stderr_tail), so the intermediate data is waste.

Consider either:

  • Using a ring buffer (e.g., collections.deque(maxlen=N)) and writing only at the end
  • Truncating/rotating the file periodically
  • Since you only need the last N lines, keeping a bounded in-memory buffer and writing to disk at process exit would be simpler and safer

5. Minor: _SUBPROCESS_STDERR_LOG placement breaks import readability

sandbox/entrypoint.py:34-35:

_SUBPROCESS_STDERR_LOG = Path("/tmp/egg-subprocess-stderr.log")
from typing import Any, ClassVar

A module-level assignment is wedged between import blocks. Move it after all imports for clarity.


6. Signal handling interaction with stderr PIPE

sandbox/entrypoint.py:1322-1337 — With stderr=subprocess.PIPE, if the parent process receives a signal (SIGTERM during container shutdown), process.wait() may return while the tee thread is still reading from the pipe. The tee_thread.join(timeout=5) handles this, but if the child writes a burst of stderr just before termination, up to 5 seconds of shutdown delay is added. This may be acceptable, but worth noting that the previous subprocess.run(stderr=sys.stderr) approach had zero shutdown overhead for this case.


7. Contract creation failure handling — error message leaks internal details

orchestrator/routes/pipelines.py:370-373 and pipelines.py:441-444 — On contract creation failure, the error response includes the raw exception:

return make_error_response(
    f"Failed to create contract: {contract_err}",
    status_code=500,
)

This is an improvement over silently continuing with a broken pipeline, which is good. But the raw exception string may contain internal paths or stack details. If this endpoint is exposed beyond internal orchestrator use, consider sanitizing the error message.


8. assume_worktree docstring is misleading

shared/egg_container/__init__.py:84-85 — The docstring says:

assume_worktree: If True, always use tmpfs

But the actual behavior when assume_worktree=True is to use /dev/null bind mounts (file-over-file), not tmpfs. The code comment at line 95-99 correctly explains the reasoning. The docstring should be updated to match.


Overall the design is sound — the shared config builder is a meaningful consolidation that eliminates a real category of configuration drift bugs. The state_store.py worktree prune fix correctly addresses a subtle cross-container filesystem visibility issue. Items 1 and 8 need fixing before merge; the rest are improvements.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@jwbron
Copy link
Owner Author

jwbron commented Feb 13, 2026

@james-in-a-box address the above feedback

@james-in-a-box
Copy link
Contributor

Addressed the review feedback. Here's what changed:

1. Bug fix: EGG_PRIVATE_MODE env var mismatch

The config builder now sets both PRIVATE_MODE (for the gateway process) and EGG_PRIVATE_MODE (for sandbox containers) in both private and public modes. This ensures the entrypoint's explicit detection at sandbox/entrypoint.py:902 works correctly instead of falling through to the legacy proxy-presence inference.

2. Bug fix: Duplicate host_uid/host_gid parsing

Hoisted host_uid/host_gid parsing above both the gateway registration try block and the config builder call in container_spawner.py, eliminating the redundant parse at the old line 305.

3. mount_spec_to_cli_args now handles volume mounts

Added volume type support (--mount type=volume,source=...,destination=...) and raises ValueError for genuinely unsupported mount types instead of silently returning []. Added 3 new tests covering volume mount (readonly/readwrite) and the error case.

4. _tee_stderr_to_file bounded

Replaced the unbounded file write with a collections.deque(maxlen=500) ring buffer. Stderr is still tee'd to sys.stderr in real time, but only the last 500 lines are written to disk when the stream ends. This prevents /tmp exhaustion on long-running containers.

5. _SUBPROCESS_STDERR_LOG moved after imports

Moved the module-level constant from between import blocks to after all imports for readability.

6–7. Acknowledged

Items 6 (signal handling / 5s shutdown delay) and 7 (error message sanitization on internal endpoint) are noted. No code changes — the current behavior is acceptable for both.

8. assume_worktree docstring corrected

Updated from "always use tmpfs" to accurately describe the /dev/null bind mount (file-over-file) behavior.

Test results

  • 45 config builder tests pass (3 new)
  • 27 CLI command builder tests pass
  • 31 gateway client tests pass
  • 60 entrypoint tests pass

— Authored by egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Unify orchestrator and CLI container launch via shared config builder

All eight items from the previous review have been addressed. The fixes are substantive, not superficial. One new issue found in the cleanup path.


Previous feedback — verification

#1 (Bug: EGG_PRIVATE_MODE mismatch): Fixed. build_sandbox_config() now sets both PRIVATE_MODE and EGG_PRIVATE_MODE for both private and public modes (__init__.py:215-216 and __init__.py:224-225). The entrypoint at entrypoint.py:903 reads EGG_PRIVATE_MODE and will now get the correct value. Good.

#2 (Bug: Duplicate host_uid/host_gid parsing): Fixed. Parsing is hoisted to container_spawner.py:265-266, above both the gateway try block and the config builder call. The redundant parse is removed.

#3 (mount_spec_to_cli_args volume support): Fixed. Volume type is now handled at __init__.py:143-148, and unrecognized types raise ValueError at line 152. Three new tests cover this (test_volume_mount, test_volume_mount_readwrite, test_unsupported_mount_type_raises).

#4 (_tee_stderr_to_file bounded): Fixed. Uses collections.deque(maxlen=500) ring buffer at entrypoint.py:1307. Only the tail is written to disk when the stream ends. The unbounded growth issue is resolved.

#5 (_SUBPROCESS_STDERR_LOG placement): Fixed. Moved to entrypoint.py:37-38, after all imports and before the first section header. Clean.

#6 (Signal handling / shutdown delay): Acknowledged, no change needed. Acceptable.

#7 (Error message sanitization): Acknowledged, no change needed. Acceptable for internal endpoint.

#8 (assume_worktree docstring): Fixed. The git_shadow_mounts docstring at __init__.py:84-87 now correctly describes /dev/null bind mounts (file-over-file) behavior.


New issues

1. Bug: Pipeline cleanup on contract failure does not commit deletion

pipelines.py:365 and pipelines.py:436 — When contract creation fails, the cleanup calls store.delete_pipeline(pipeline.id, commit=False). But create_pipeline() (called just above) commits the pipeline file to the git worktree by default. So the deletion only removes the file from the filesystem — it is not committed to git.

On the next orchestrator restart (or any call that re-reads the worktree from git), the zombie pipeline reappears without a contract. The commit=False argument was presumably intended to avoid noisy commits during an error path, but it defeats the cleanup.

Fix: Pass commit=True (or omit commit to use the default), so the deletion is actually persisted. For issue pipelines in particular, delete_pipeline already commits when commit=True and the pipeline is not local-prefixed.

2. Stale comment: assume_worktree=True (tmpfs) in container_spawner.py

container_spawner.py:248:

# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).

This is left over from before the docstring fix. assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The parenthetical should say (/dev/null bind) or simply be removed.


Item 1 is a correctness bug that should be fixed before merge. Item 2 is minor.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Three bugs caused failures when the orchestrator spawns multiple
containers (coder + reviewer) in the refine phase:

1. Session-container IP binding: register_session() hardcodes
   EGG_CONTAINER_IP but Docker assigns actual IPs dynamically. After
   container start, update the gateway session with the actual IP so
   the gateway accepts git operations from reviewer containers.

2. AgentRole mapping crash: REVIEWER and CHECKER roles have no
   ContractAgentRole equivalent. Changed map_agent_role_to_contract_role
   to return Optional and skip contract updates for unmappable roles,
   preventing KeyError on completion signals.

3. Decision queue race condition: get_decision() and
   get_pending_decisions() loaded pipeline state without holding the
   lock, allowing another thread to overwrite state between
   queue_decision() and wait_for_decision(). Added lock acquisition
   to both methods.

Authored-by: egg
Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Unify orchestrator and CLI container launch via shared config builder

The branch has been rebased into a single commit (99ad2f2 Fix multi-agent pipeline bugs for refine phase). Reviewed the full PR diff against main.

Previous feedback — verification

Original review items #1#8 were addressed in the prior round. The second review at ce25386 found two remaining issues. Neither has been fixed.


1. Bug (unfixed from prior review): delete_pipeline(commit=False) leaves zombie pipelines in git

pipelines.py:365 and pipelines.py:436 — When contract creation fails, cleanup calls:

store.delete_pipeline(pipeline.id, commit=False)

create_pipeline() at state_store.py:484 calls save_pipeline(pipeline, message=commit_msg) with default commit=True, which commits the pipeline file to the git worktree. The subsequent delete_pipeline(..., commit=False) only calls path.unlink() — it removes the file from disk but does NOT commit the deletion.

On orchestrator restart (which re-reads state from git), the pipeline file reappears as a zombie without a contract. For issue pipelines specifically, is_local is False and should_commit = commit and (not is_local or force_commit) = False and True = False, confirming the deletion is not committed.

Fix: Pass commit=True (or omit the commit kwarg) so the deletion is persisted to git. The save_pipeline that created the pipeline already committed, so the deletion must also commit to undo it.


2. Stale comment (unfixed from prior review): assume_worktree=True (tmpfs)

container_spawner.py:248:

# Orchestrator can't stat host paths, so assume_worktree=True (tmpfs).

assume_worktree=True produces /dev/null bind mounts (file-over-file), not tmpfs. The git_shadow_mounts docstring at __init__.py:84-87 was corrected in the prior round, but this caller-side comment was not updated.


3. Stale docstring: _spawn_and_wait says "tmpfs" for .git shadow

pipelines.py:1304-1305:

    with .git shadowed by tmpfs to force gateway git operations.

Same issue as #2. The orchestrator path uses assume_worktree=True, which produces /dev/null bind mounts, not tmpfs.


4. complete_agent silently drops outputs for non-contract roles

dispatch.py:189-199 — When contract_role is None (REVIEWER, CHECKER), save_agent_output is skipped:

if contract_role is not None:
    self.contract_orchestrator.complete_agent(...)
    if outputs:
        save_agent_output(self.repo_path, contract_role, outputs)

The save_agent_output call is nested inside the if contract_role is not None block, so any outputs passed for a REVIEWER or CHECKER completion are silently discarded. If this is intentional (reviewers/checkers store verdicts elsewhere), add a comment explaining why. If not, the save_agent_output should be moved outside the if block, though it would need a non-None key to save under.


Item 1 is a correctness bug that should be fixed before merge. Items 2–3 are documentation bugs. Item 4 needs clarification.

— Authored by egg

@jwbron
Copy link
Owner Author

jwbron commented Feb 13, 2026

@james-in-a-box address this feedback: #594 (review)

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Contributor

Addressed the review feedback. Here's what changed:

1. Bug fix: delete_pipeline(commit=False) leaves zombie pipelines in git

Changed both contract failure cleanup paths (pipelines.py:365 and pipelines.py:436) from store.delete_pipeline(pipeline.id, commit=False) to store.delete_pipeline(pipeline.id) (default commit=True).

For issue pipelines, create_pipeline() commits the pipeline file to git via save_pipeline(commit=True). With commit=False on deletion, only the filesystem file was removed — on orchestrator restart the pipeline would reappear from git as a zombie without a contract. Now the deletion is committed, matching the creation.

For local pipelines, save_pipeline skips git commit unless force_commit=True, and delete_pipeline with commit=True also skips commit for local pipelines (via the is_local check), so the behavior is correct either way — but using the default is cleaner and consistent.

2. Stale comment: assume_worktree=True (tmpfs) in container_spawner.py

Updated container_spawner.py:248 from "(tmpfs)" to "(/dev/null bind)" to match the actual file-over-file bind mount behavior.

3. Stale docstring: _spawn_and_wait says "tmpfs" for .git shadow

Updated pipelines.py:1305 from "shadowed by tmpfs" to "shadowed by /dev/null bind mounts".

4. complete_agent silently drops outputs for non-contract roles

Intentional behavior. Added a comment at dispatch.py:197-200 explaining why: REVIEWER and CHECKER roles don't have contract counterparts (their map_agent_role_to_contract_role returns None), so there's no contract role key to save outputs under. Their verdicts are stored in the AgentExecution record returned by complete_agent, not as contract outputs. save_agent_output requires a ContractAgentRole which doesn't exist for these roles.

Test results

  • 45 config builder tests pass
  • 30 container spawner tests pass
  • 31 gateway client tests pass
  • 60 entrypoint tests pass (1 skipped)
  • 1 pre-existing failure in test_state_store.py::TestStateValidation::test_load_invalid_schema_fails (unrelated)

— Authored by egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Unify orchestrator and CLI container launch via shared config builder

All four items from the previous review have been properly addressed.


Previous feedback — verification

#1 (Bug: delete_pipeline(commit=False) zombie pipelines): Fixed. Both cleanup paths (pipelines.py:365 and pipelines.py:436) now call store.delete_pipeline(pipeline.id) with the default commit=True. For issue pipelines (is_local=False), delete_pipeline evaluates should_commit = True and (not False or False) = True, so the deletion is committed to git, matching the save_pipeline commit at creation. For local pipelines, should_commit evaluates to False, which is consistent with save_pipeline also skipping git commits for local pipelines (both disk-only). No zombie possible in either path.

#2 (Stale comment: assume_worktree=True (tmpfs)): Fixed. container_spawner.py:248 now reads (/dev/null bind).

#3 (Stale docstring: _spawn_and_wait says "tmpfs"): Fixed. pipelines.py:1305 now reads /dev/null bind mounts.

#4 (complete_agent silently drops outputs for non-contract roles): Addressed with explanatory comment at dispatch.py:197-200 explaining that REVIEWER/CHECKER roles don't have contract counterparts and their verdicts are stored in the AgentExecution record. Clear and correct.


New code — no blocking issues found

Examined the full delta. The changes are clean and well-scoped. A few minor advisory notes:

1. _tee_stderr_to_file bare except Exception: pass (entrypoint.py:1319)

The entire tee thread body is wrapped in except Exception: pass. If the tee thread fails silently (e.g., sys.stderr.buffer write fails), stderr output is lost and _SUBPROCESS_STDERR_LOG is never written. The process continues fine, but error signals to the orchestrator will have no stderr context. Acceptable for now given this is best-effort diagnostic data, but a logger.debug in the except clause would aid debugging.

2. to_dockerpy_kwargs does not forward container_ip (__init__.py:274-292)

SandboxContainerConfig carries container_ip but to_dockerpy_kwargs doesn't include it in the returned kwargs. The orchestrator path doesn't currently need static IPs (it uses _get_container_ip after start + update_session), so this is not a bug today. But it's a subtle gap between the CLI path (which passes --ip via build_sandbox_docker_cmd) and the docker-py path. If a future caller passes container_ip through the orchestrator path expecting it to be applied, it would be silently ignored.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@jwbron
Copy link
Owner Author

jwbron commented Feb 13, 2026

@james-in-a-box address the non-blocking issues here #594 (review)

@james-in-a-box
Copy link
Contributor

Addressed the non-blocking advisory notes. Here's what changed:

1. _tee_stderr_to_file bare except Exception: pass (entrypoint.py:1319)

Replaced with except Exception as exc: that prints a debug message to stderr before swallowing the error. The tee thread is still best-effort (it won't crash the container), but failures are no longer completely silent — the exception is logged to stderr for post-mortem diagnosis.

2. to_dockerpy_kwargs now forwards container_ip (__init__.py:293-300)

When config.container_ip is set, to_dockerpy_kwargs now includes a networking_config dict with the Docker Engine API's EndpointsConfig format, specifying the IPv4Address under the config's network. This closes the gap between the CLI path (which uses --ip) and the docker-py path. Two new tests cover the positive and negative cases.

Test results

  • 47 config builder tests pass (2 new)
  • 60 entrypoint tests pass (1 skipped, pre-existing)

— Authored by egg

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Unify orchestrator and CLI container launch via shared config builder

The branch has been rebased into a single squashed commit (7ef3fa5). Reviewed the full PR diff. Both non-blocking suggestions from the previous approval have been addressed.


Previous feedback — verification

#1 (_tee_stderr_to_file bare except Exception: pass): Fixed. The except clause at entrypoint.py:1319-1328 now prints a [DEBUG] message to stderr before swallowing the exception. Best-effort diagnostic failures are no longer completely silent.

#2 (to_dockerpy_kwargs does not forward container_ip): Fixed. When config.container_ip is set, to_dockerpy_kwargs at __init__.py:291-298 now produces a networking_config dict with EndpointsConfig containing the IPv4Address under the config's network name. Two new tests (test_container_ip_produces_networking_config, test_no_container_ip_omits_networking_config) cover both the positive and negative cases. The networking_config passes through docker_client.create_container()'s **kwargs and coexists correctly with the named network parameter in Docker SDK 7.x.


New code — no blocking issues found

Examined the full delta systematically. The implementation is clean.

Advisory notes (non-blocking):

1. _read_subprocess_stderr_tail race if tee thread join times out

In _run_with_stderr_capture (entrypoint.py:1355-1356), process.wait() followed by tee_thread.join(timeout=5) is correct — once the child exits, the write end of the stderr pipe closes, so readline() returns b"" and the thread exits the loop and flushes the ring buffer to disk. However, if the join times out (e.g., the thread is stuck in the disk write on a degraded filesystem), _read_subprocess_stderr_tail(30) at line 1362 may read a partially-written or absent log file. The impact is minor (truncated stderr context in the error signal, not data corruption) and the 5-second timeout is generous. Acceptable as-is.

2. volumes=volumes or {} in docker_client.create_container — vestigial parameter

docker_client.py:188 always passes volumes=volumes or {} to the Docker SDK. The orchestrator path now uses mounts (list format via to_dockerpy_kwargs), not volumes (dict format). Since to_dockerpy_kwargs doesn't include volumes in its kwargs, the SDK receives volumes={} alongside mounts=[...]. This works correctly — volumes={} is a no-op and mounts takes precedence. But the volumes parameter on create_container is now unused by the orchestrator path and could be cleaned up in a follow-up to reduce confusion about which mount API is canonical.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg review completed. View run logs

4 previous review(s) hidden.

@jwbron jwbron merged commit d9d3e23 into main Feb 13, 2026
15 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 13, 2026
Update documentation to reflect changes from PR #594 which unified
orchestrator and CLI container launch via a shared config builder.

Changes:
- Add egg_container to STRUCTURE.md shared libraries section
- Update shared/README.md with new egg_container functionality:
  - build_sandbox_config() framework-agnostic config builder
  - to_dockerpy_kwargs() adapter for docker-py SDK
  - git_shadow_mounts() for .git mount generation
  - SandboxContainerConfig and MountSpec dataclasses

Triggered by: https://github.com/anthropics/egg/pull/594

Authored-by: egg
jwbron added a commit that referenced this pull request Feb 13, 2026
* docs: Update shared library docs for egg_container

Update documentation to reflect changes from PR #594 which unified
orchestrator and CLI container launch via a shared config builder.

Changes:
- Add egg_container to STRUCTURE.md shared libraries section
- Update shared/README.md with new egg_container functionality:
  - build_sandbox_config() framework-agnostic config builder
  - to_dockerpy_kwargs() adapter for docker-py SDK
  - git_shadow_mounts() for .git mount generation
  - SandboxContainerConfig and MountSpec dataclasses

Triggered by: https://github.com/anthropics/egg/pull/594

Authored-by: egg

* Fix incorrect API usage examples in shared/README.md

- build_sandbox_config(): Remove non-existent params repo_volumes and
  container_id, add required image param
- build_sandbox_docker_cmd(): Use keyword-only args matching actual
  function signature instead of positional arg

---------

Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>
Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant