Skip to content

Add egg-orchestrator for local SDLC pipeline execution#524

Merged
jwbron merged 43 commits intomainfrom
egg/issue-496
Feb 12, 2026
Merged

Add egg-orchestrator for local SDLC pipeline execution#524
jwbron merged 43 commits intomainfrom
egg/issue-496

Conversation

@jwbron
Copy link
Owner

@jwbron jwbron commented Feb 11, 2026

Summary

Introduces egg-orchestrator, a new Docker container service that manages the
full SDLC pipeline lifecycle locally, replacing GitHub Actions orchestration.
The orchestrator stores state in git (consistent with contracts), manages
sandbox container lifecycle via Docker API, supports multi-agent parallel
execution, and integrates HITL decisions through Claude Code sessions.

This latest commit adds 14 integration tests (all passing) and fixes three
sandbox container spawning bugs that caused immediate failure.

Bug fixes

The orchestrator's _run_pipeline() was spawning sandbox containers that
immediately failed for three reasons:

  1. Missing repo volume mount — Docker socket operates on the host daemon,
    so volume mounts need host paths, not container paths. Added
    EGG_HOST_REPOS_DIR env var to pass the host-side repo path.

  2. Missing GATEWAY_URL — Sandbox entrypoint health check defaulted to
    http://egg-gateway:9848 (DNS name) which doesn't resolve from a container
    on the isolated network. Now passes the IP-based URL from the orchestrator.

  3. No command passed — Without explicit args, the sandbox entrypoint
    launched interactive Claude Code (no TTY attached → immediate error). Now
    builds a claude --print command with phase-specific prompt matching the
    GHA SDLC workflow flags (--verbose, --output-format stream-json,
    --max-turns 200).

Additionally: container logs are now captured before cleanup, so failure
messages include the last 10 lines of sandbox output for debugging.

Integration tests (14 passing)

Test Verifies
test_create_local_pipeline POST creates pipeline with correct mode/prompt/status
test_start_local_pipeline_completes Full lifecycle: create → start → poll → complete
test_local_pipeline_no_pr_phase Local = 3 phases (refine/plan/implement), no PR
test_local_pipeline_container_failure FORCE_FAIL prompt → exit 1, pipeline fails
test_failure_error_includes_container_logs Error message contains sandbox log output
test_sandbox_receives_pipeline_env Pipeline env vars passed (exit 2 if missing)
test_sandbox_receives_gateway_url GATEWAY_URL passed (exit 3 if missing)
test_sandbox_receives_repo_volume Repo volume mounted (exit 4 if missing)
test_start_running_pipeline_returns_409 Cannot start already-running pipeline
test_start_completed_pipeline_returns_409 Cannot start completed pipeline
test_issue_pipeline_includes_pr_phase Issue mode runs 4 phases including PR
test_pipeline_crud Full CRUD lifecycle
test_gateway_local_mode_blocks_push Local session push → 403
test_failed_pipeline_returns_409 Failed pipeline → 409 on restart

Known issues (analysis, not yet fixed)

1. Missing EGG_SESSION_TOKEN for local-mode containers

Symptom: The sandbox agent hits ERROR: EGG_SESSION_TOKEN not set when
trying to use git operations inside the container.

Root cause: container_spawner.py:200 has if mode != "local" and repos:
which explicitly skips gateway session registration for local pipelines. The
original rationale was "local pipelines don't need gateway access," but:

  • The sandbox's git/gh wrapper scripts always require EGG_SESSION_TOKEN
  • The gateway enforces local-mode restrictions (push blocking) at the
    session level — without a session, there's no enforcement
  • Even for local mode, the sandbox needs the session for worktree setup

Fix: Remove the mode != "local" guard and always register sessions.
For local pipelines without an explicit repo, detect the repo from the
mounted volume or pass a default. The gateway already handles mode="local"
sessions correctly (blocks push, allows read).

2. OOM kill on memory-constrained systems (exit code 137)

Symptom: Sandbox container exits with code 137, empty logs, container
already gone before log capture.

Root cause: Exit code 137 = SIGKILL (128 + 9), typically from the kernel
OOM killer. Spawning a full sandbox container (~400MB for Claude Code) on a
system that's already under memory pressure triggers OOM. Observed on a 32GB
system with ~1GB free RAM and swap nearly exhausted (Firefox was consuming
~15GB).

Analysis: The sandbox entrypoint runs 11 setup phases before launching
Claude Code. Most are necessary even for local mode (worktree isolation
prevents stomping on the user's working tree). Reducing peak memory without
skipping critical phases requires investigation into:

  • OOM detection: Check Docker's OOMKilled flag after exit code 137 for
    a clear diagnostic message instead of mysterious "Container exited with
    code 137"
  • Pre-flight memory check: Before spawning, verify available memory and
    fail early with actionable guidance
  • Entrypoint optimization: Identify which setup phases can be deferred or
    made lighter without compromising isolation guarantees

Closes: #496

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

This PR introduces the egg-orchestrator service which manages SDLC pipeline lifecycle locally. From an agent-mode design perspective, the architecture aligns well with the principles in docs/guides/agent-mode-design.md:

Good patterns observed:

  • Agent autonomy: Sandboxes are spawned with context (pipeline ID, issue number, agent role) but agents fetch their own data through tools rather than receiving pre-baked diffs or logs
  • No structured output requirements: The orchestrator communicates with agents via simple signals (completion, failure) rather than requiring parsed JSON output
  • Direct action: Agents post reviews, create PRs, and commit code directly via gateway—no post-processing pipelines parsing agent output
  • Objectives over procedures: The /sdlc skill provides the "what" (start a pipeline for issue N) and lets the orchestrator handle orchestration while agents handle execution

Architecture notes:

  • The SandboxConfig passes lightweight metadata (session tokens, role, issue number) which helps orient agents without constraining them
  • Container spawning provides environment setup (proxy, gateway URLs) rather than pre-fetched content
  • The decision queue for HITL provides objectives ("approve this draft") rather than procedural instructions

The design keeps orchestration logic in the orchestrator (where it belongs) and lets agents operate autonomously within their sandboxes.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #524 - Egg Orchestrator

This is a comprehensive review of the orchestrator implementation. I've analyzed all files systematically for security vulnerabilities, correctness issues, race conditions, and design problems.


🔴 Critical Security Issues

1. Container ID Injection Vulnerability in docker_client.py

Location: orchestrator/docker_client.py - logs(), stop(), remove(), and other methods

The container ID is passed directly to Docker API calls without validation. An attacker who can control the container ID could potentially manipulate Docker operations.

def logs(self, container_id: str, ...) -> str:
    container = self.client.containers.get(container_id)  # No validation

Fix: Add container ID format validation before API calls:

def _validate_container_id(self, container_id: str) -> None:
    if not re.match(r'^[a-zA-Z0-9][a-zA-Z0-9_.-]*$', container_id):
        raise ValueError(f"Invalid container ID format: {container_id}")

2. Path Traversal Risk in state_store.py

Location: orchestrator/state_store.py - _pipeline_path() method

Pipeline IDs are used directly to construct file paths without proper sanitization:

def _pipeline_path(self, pipeline_id: str) -> Path:
    return self.pipelines_dir / f"{pipeline_id}.json"

If pipeline_id contains ../, it could write files outside the intended directory.

Fix: Validate and sanitize pipeline IDs:

def _pipeline_path(self, pipeline_id: str) -> Path:
    if not re.match(r'^issue-[0-9]+$', pipeline_id):
        raise ValueError(f"Invalid pipeline ID format: {pipeline_id}")
    path = self.pipelines_dir / f"{pipeline_id}.json"
    # Ensure resolved path is still under pipelines_dir
    if not path.resolve().is_relative_to(self.pipelines_dir.resolve()):
        raise ValueError("Path traversal detected")
    return path

3. Missing Authentication on API Endpoints

Location: All files in orchestrator/routes/

None of the API endpoints implement authentication or authorization. Any process with network access to the orchestrator can:

  • Create/delete pipelines
  • Spawn containers
  • Resolve HITL decisions
  • Modify pipeline state

Fix: Add authentication middleware using gateway session tokens:

def require_auth(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        token = request.headers.get('X-Session-Token')
        if not token or not validate_session_token(token):
            return jsonify({"error": "Unauthorized"}), 401
        return f(*args, **kwargs)
    return decorated

4. Command Injection in Webhook Handler

Location: orchestrator/webhooks.py - Line ~13450

The issue event handler directly uses issue_number from the webhook payload without validation:

pipeline_id=f"issue-{issue_number}"

While not immediately exploitable, a malformed issue number from a spoofed webhook could cause issues.

Fix: Validate issue_number is an integer before use.


🟠 Correctness Issues

5. Race Condition in Container Spawning

Location: orchestrator/container_spawner.py - spawn_container() method

The method reads state, creates container, then updates state without any locking mechanism. Concurrent spawn requests could result in:

  • Duplicate containers
  • Lost state updates
  • Inconsistent pipeline state

Fix: Add a locking mechanism:

_spawn_locks: dict[str, threading.Lock] = {}

def spawn_container(self, ...):
    lock_key = f"{pipeline_id}:{agent_role}"
    if lock_key not in _spawn_locks:
        _spawn_locks[lock_key] = threading.Lock()
    
    with _spawn_locks[lock_key]:
        # existing spawn logic

6. Race Condition in Phase Transitions

Location: orchestrator/routes/phases.py - advance_phase() endpoint

Multiple concurrent requests to advance phases could result in skipped phases or invalid transitions:

@phases_bp.route("/<pipeline_id>/phases/advance", methods=["POST"])
def advance_phase(pipeline_id: str):
    pipeline = store.load_pipeline(pipeline_id)
    # No lock between load and save
    pipeline.current_phase = next_phase
    store.save_pipeline(pipeline)

Fix: Implement optimistic locking using a version field on the Pipeline model.

7. Missing Error Handling for Docker Connection Failures

Location: orchestrator/docker_client.py - __init__()

The DockerClient constructor doesn't handle connection failures gracefully:

def __init__(self, ...):
    self.client = docker.from_env()  # Can raise DockerException

Fix: Wrap in try/except and provide meaningful error message.

8. Histogram Bucket Counting Bug

Location: orchestrator/metrics.py - Histogram.observe() method

The bucket counting logic appears to count values into ALL buckets where value <= bucket rather than just the appropriate bucket:

for bucket in self.buckets:
    if value <= bucket:
        self._bucket_counts[bucket] += 1  # Increments multiple buckets

This is actually the correct cumulative behavior for Prometheus histograms, but the get_bucket_counts() returns cumulative counts which may confuse users expecting per-bucket counts. Consider adding documentation to clarify.

9. Git Commit Can Fail Silently

Location: orchestrator/state_store.py - _commit()

If git operations fail, the error is logged but not raised, which could lead to data loss:

def _commit(self, message: str) -> None:
    try:
        self._run_git(["add", ...])
        self._run_git(["commit", "-m", message])
    except Exception as e:
        logger.error("Git commit failed", error=str(e))
        # Error is swallowed!

Fix: Re-raise the exception or implement a retry mechanism with eventual consistency guarantees.


🟡 Robustness Issues

10. Missing Request Validation in All Routes

Location: All routes in orchestrator/routes/

Most endpoints don't validate required fields:

@pipelines_bp.route("/", methods=["POST"])
def create_pipeline():
    data = request.get_json()
    issue_number = data.get("issue_number")  # Could be None
    repo = data.get("repo")  # Could be None

Fix: Add Pydantic request validation:

class CreatePipelineRequest(BaseModel):
    issue_number: int
    repo: str
    branch: str

11. Timeout Not Enforced on Container Operations

Location: orchestrator/docker_client.py - multiple methods

Docker operations can hang indefinitely if a container becomes unresponsive:

def logs(self, container_id: str, tail: int = 100) -> str:
    container = self.client.containers.get(container_id)
    return container.logs(tail=tail).decode()  # No timeout

Fix: Add timeout parameters to Docker API calls.

12. Circuit Breaker Has No Persistence

Location: orchestrator/resilience.py - CircuitBreaker class

Circuit breaker state is stored in-memory. If the orchestrator restarts, all circuits reset to CLOSED state, potentially causing cascading failures.

Fix: Persist circuit breaker state to Redis or the state store.

13. HITL Decision Timeout Not Actively Checked

Location: orchestrator/routes/decisions.py

The timeout is stored on decisions but there's no background task to check for and handle timeouts:

@decisions_bp.route("/<pipeline_id>/decisions/check-timeouts", methods=["POST"])
def check_timeouts(pipeline_id: str):

This endpoint exists but requires manual invocation.

Fix: Add a background scheduler (APScheduler or similar) to check timeouts periodically.

14. Multi-Agent Executor Missing Error Recovery

Location: orchestrator/multi_agent.py - MultiAgentExecutor

If one agent in a wave fails, there's no clear recovery strategy:

async def execute_wave(self, wave: AgentWave) -> list[AgentExecution]:
    results = await asyncio.gather(*tasks, return_exceptions=True)

Exceptions are captured but not properly handled - the wave continues without addressing failures.

Fix: Implement proper failure handling with configurable strategies (fail-fast, continue, retry).


🔵 Design Issues

15. Inconsistent Error Response Format

Location: All routes

Some endpoints return {"error": "message"} while others return {"success": false, "message": "..."}:

# containers.py
return jsonify({"error": "Container not found"}), 404

# pipelines.py  
return jsonify({"success": False, "message": "Pipeline not found"}), 404

Fix: Standardize on a single error response format across all endpoints.

16. Circular Import Risk

Location: orchestrator/webhooks.py

The module imports from events inside handler functions to avoid circular imports:

def _handle_issue_event(payload):
    from events import EventType, emit_event  # Late import

This is a code smell indicating the module structure may need refactoring.

Fix: Consider a more modular event system with proper dependency injection.

17. Hardcoded Network Configuration

Location: orchestrator/gateway_client.py and orchestrator/sandbox_template.py

Network CIDRs and IPs are hardcoded:

SANDBOX_NETWORK = "172.32.0.0/24"

Fix: Move to configuration file or environment variables.

18. Missing Structured Logging Context

Location: Multiple files

Many log statements don't include relevant context:

logger.info("Container spawned")  # Missing container_id, pipeline_id

Fix: Always include relevant identifiers in log statements.


📝 Test Coverage Gaps

19. No Integration Tests

The tests are all unit tests with mocked dependencies. There are no integration tests that verify:

  • End-to-end pipeline execution
  • Docker container lifecycle
  • Gateway integration
  • State persistence under concurrent access

20. No Load/Stress Tests

No tests verify behavior under concurrent load or when resources are constrained.

21. Tests Don't Cover Error Paths

Many error scenarios are untested:

  • Docker daemon unavailable
  • Git repository corrupted
  • Network failures to gateway
  • Disk full conditions

✅ Positive Notes

  • The Pydantic models are well-structured with proper validation
  • The event system design is clean and extensible
  • Circuit breaker pattern is correctly implemented
  • Webhook signature verification is properly done with constant-time comparison
  • Test fixtures are well organized

Summary

Severity Count
Critical Security 4
Correctness 5
Robustness 5
Design 4
Test Gaps 3

Recommendation: Request changes. The security issues (container ID validation, path traversal, missing auth) must be addressed before merge. The race conditions should also be fixed as they could cause data corruption in production.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@jwbron
Copy link
Owner Author

jwbron commented Feb 11, 2026

@james-in-a-box address this feedback: #524 (review)

@james-in-a-box
Copy link
Contributor

Addressed Review Feedback

I've pushed changes (7671581) addressing the critical security issues and race conditions from the review:

✅ Fixed: Critical Security Issues

1. Container ID Injection Vulnerability (docker_client.py)

  • Added _validate_container_id() function with regex pattern validation
  • Pattern allows: 12-64 char hex IDs or valid container names (alphanumeric, underscore, hyphen, period)
  • Added validation to all methods accepting container_id: start_container, stop_container, remove_container, get_container_info, get_container_logs, wait_for_container
  • New InvalidContainerIdError exception for invalid formats

2. Path Traversal Risk (state_store.py)

  • Added _validate_pipeline_id() function enforcing issue-{number} format
  • Added secondary check ensuring resolved path stays within pipelines_dir
  • New InvalidPipelineIdError exception for invalid formats or traversal attempts

✅ Fixed: Race Conditions

5. Race Condition in Phase Transitions (routes/phases.py)

  • Added version field to Pipeline model for optimistic locking
  • Modified save_pipeline() to accept expected_version parameter
  • Added VersionConflictError exception (returns 409 Conflict)
  • Updated all phase mutation endpoints (advance_phase, start_phase, complete_phase, fail_phase) to use optimistic locking

✅ Already Addressed: Docker Connection Failures

7. Docker Connection Error Handling (docker_client.py)

  • Verified: The __init__ method already wraps docker.from_env() in try/except and raises DockerClientError with a meaningful message (line 79-85)

Route Updates

All route files updated with proper exception handling for the new error types:

  • routes/containers.py: Handles InvalidContainerIdError (400)
  • routes/pipelines.py: Handles InvalidPipelineIdError (400)
  • routes/phases.py: Handles InvalidPipelineIdError (400) and VersionConflictError (409)
  • routes/decisions.py: Handles InvalidPipelineIdError (400)

Remaining Items (not addressed in this commit)

The following items from the review are noted but not yet implemented:

I recommend addressing the authentication requirement (#3) as a follow-up, since it requires decisions on the auth mechanism and integration with the gateway session tokens.

— Authored by egg

@jwbron
Copy link
Owner Author

jwbron commented Feb 11, 2026

@james-in-a-box pull main into this branch

@james-in-a-box
Copy link
Contributor

Merged main into this branch

Successfully pulled main into egg/issue-496 (commit 9301879).

Changes merged from main:

No merge conflicts occurred. The branch is now up to date with main.

— Authored by egg

@james-in-a-box james-in-a-box bot dismissed their stale review February 11, 2026 20:07

Superseded by new review

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns with the delta since my last review.

The changes since commit a1c45ddd address security and race condition issues identified in the base code review, and they do so in a way that aligns well with agent-mode design principles:

Good pattern: Infrastructure-level constraints

  • Container ID validation and pipeline ID validation are enforced in docker_client.py and state_store.py, not via prompt instructions
  • Optimistic locking with version conflicts is handled at the state store level, returning proper 409 errors
  • Path traversal protection uses resolved path validation, not trust in agent behavior

This follows the core principle from the design guide: "The sandbox is the constraint." These security fixes extend the infrastructure boundary rather than relying on prompt-level restrictions that agents could ignore.

No changes to agent prompts, pre-fetching patterns, or output format requirements were introduced in this delta.

— Authored by egg

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review: PR #524 - Egg Orchestrator

This re-review examines changes made since commit a1c45ddd.


✅ Issues Properly Addressed

1. Container ID Injection Vulnerability (docker_client.py)

Fixed correctly. The _validate_container_id() function validates container IDs using an appropriate regex pattern that allows:

  • 12-64 character hex IDs (short/full Docker IDs)
  • Valid container names (alphanumeric, underscore, hyphen, period - not starting with hyphen/period)

The validation is applied to all relevant methods: start_container, stop_container, remove_container, get_container_info, get_container_logs, wait_for_container.

2. Path Traversal Risk (state_store.py)

Fixed correctly. The implementation adds:

  • _validate_pipeline_id() with strict ^issue-[0-9]+$ pattern enforcement
  • Secondary is_relative_to() check to catch any bypass attempts
  • Appropriate InvalidPipelineIdError exception

3. Optimistic Locking for Phase Transitions (routes/phases.py)

Fixed correctly. All phase mutation endpoints (advance_phase, start_phase, complete_phase, fail_phase) now:

  • Capture original_version before modifications
  • Pass expected_version to save_pipeline()
  • Handle VersionConflictError with 409 Conflict response

4. Version Field on Pipeline Model (models.py)

Fixed correctly. The version field is added with default=1, ge=1 constraint and proper documentation.


🟠 Remaining Issues from Previous Review (Not Addressed)

These were noted as "not addressed in this commit" in the previous response, which is acceptable, but they should still be tracked:

5. Container Spawner Race Condition (container_spawner.py)

Not addressed. The spawn_agent_container method still has no locking mechanism. Concurrent calls for the same pipeline/role could create duplicate containers. This is a lower priority than the fixed issues, but should be addressed before high-concurrency use.

6. Silent Git Commit Failures (state_store.py:224)

Partially addressed. The _commit_state method now raises GitOperationError if git commands fail, which addresses the concern. However, _run_git catches CalledProcessError and re-raises as GitOperationError, which is appropriate.


🔴 New Issue Found

7. Missing Pipeline ID Validation in routes/signals.py

The signal routes do not validate pipeline IDs before using them. While get_state_store eventually validates in _get_pipeline_path, the error handling doesn't catch InvalidPipelineIdError:

# signals.py:169 - only catches PipelineNotFoundError, not InvalidPipelineIdError
try:
    store = get_state_store(repo_path)
    pipeline = store.load_pipeline(pipeline_id)
    ...
except PipelineNotFoundError:
    return make_error_response(...)

Fix: Add InvalidPipelineIdError to the exception handlers in handle_complete_signal, handle_error_signal, and the batch handler.


🟡 Design Observations

8. Inconsistent Error Response Handling

Routes containers.py, phases.py, pipelines.py, and decisions.py now properly handle InvalidPipelineIdError. However, signals.py was missed. The pattern should be consistently applied across all routes.

9. Route files containers.py, pipelines.py, phases.py, decisions.py

All properly import and handle the new exception types. Good work on consistency.


Summary

Category Status
Container ID Validation ✅ Fixed
Path Traversal ✅ Fixed
Optimistic Locking ✅ Fixed
Pipeline Model Version ✅ Fixed
Container Spawner Race ⚠️ Not addressed (acceptable deferral)
Signals.py Error Handling 🔴 New issue - must fix

Recommendation: Request changes. The signals.py error handling gap needs to be addressed since it's a simple fix (adding one import and one exception handler to each signal handler) and follows the same pattern already applied to other routes.


— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box james-in-a-box bot dismissed their stale review February 11, 2026 20:15

Superseded by new review

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns with the delta since my last review.

The changes since commit 930187936f02a8f123bc5f63ab782be3fd7f9f1e include a merge from main and security fixes. From an agent-mode design perspective, these changes actually strengthen alignment with the design guidelines:

Good pattern: Infrastructure-level constraints
The security fixes follow the core principle "The sandbox is the constraint":

  • Container ID validation in docker_client.py prevents injection at the API layer
  • Pipeline ID validation in state_store.py prevents path traversal at the persistence layer
  • Optimistic locking with version field handles concurrency at the state management layer

These are infrastructure-level constraints, not prompt-level restrictions the agent could ignore.

No changes to agent prompts or context patterns
The orchestrator continues to pass only lightweight metadata to spawned containers:

  • Pipeline ID, issue number, agent role (orienting context)
  • Session tokens, proxy URLs (infrastructure credentials)
  • No pre-fetched diffs, logs, or file contents

The agents remain free to fetch what they need through their tools.

— Authored by egg

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Security and Race Condition Fixes (Delta since 9301879)

I've reviewed the changes since commit 9301879. The commit f1254c1 addresses several issues from my previous review.


✅ Issues Properly Addressed

1. Container ID Validation (docker_client.py)

The validation is well-implemented:

  • Pattern ^[a-fA-F0-9]{12,64}$|^[a-zA-Z0-9][a-zA-Z0-9_.-]*$ correctly validates both hex IDs (12-64 chars) and container names
  • Validation is applied to start_container, stop_container, remove_container, get_container_info, get_container_logs, and wait_for_container
  • New InvalidContainerIdError exception properly exposed

2. Path Traversal Prevention (state_store.py)

Properly implemented:

  • Pattern ^issue-[0-9]+$ restricts pipeline IDs to expected format
  • Secondary is_relative_to() check ensures resolved path stays within pipelines_dir
  • New InvalidPipelineIdError exception

3. Optimistic Locking (state_store.py, routes/phases.py)

Implemented correctly:

  • version field added to Pipeline model with default=1
  • save_pipeline() accepts expected_version parameter
  • VersionConflictError raised on version mismatch
  • All phase mutation endpoints (advance_phase, start_phase, complete_phase, fail_phase) use optimistic locking

4. Route Error Handling

New exceptions properly caught in routes:

  • routes/containers.py: Handles InvalidContainerIdError (400)
  • routes/pipelines.py: Handles InvalidPipelineIdError (400)
  • routes/phases.py: Handles InvalidPipelineIdError (400) and VersionConflictError (409)
  • routes/decisions.py: Handles InvalidPipelineIdError (400)

🟠 Remaining Issues

1. Missing InvalidContainerIdError Handling in Health Check Endpoint

Location: orchestrator/routes/containers.py:426-447

The /health endpoint calls monitor.check_container_health(container_id), which in turn calls docker_client.get_container_info(container_id). This now raises InvalidContainerIdError, but the route doesn't catch it:

@containers_bp.route("/<pipeline_id>/containers/<container_id>/health", methods=["GET"])
def check_container_health(pipeline_id: str, container_id: str) -> tuple[Response, int]:
    monitor = get_container_monitor()
    health = monitor.check_container_health(container_id)  # Can raise InvalidContainerIdError
    return make_success_response("Health checked", data=health)

Fix: Add the same InvalidContainerIdError handler as in other endpoints, or catch it in ContainerMonitor.check_container_health().

2. No Tests for Security Validation Functions

The new validation functions and exceptions (_validate_container_id, _validate_pipeline_id, InvalidContainerIdError, InvalidPipelineIdError, VersionConflictError) have no corresponding tests. The test files (test_docker_client.py, test_state_store.py) don't test:

  • Invalid container ID formats being rejected
  • Path traversal attempts being blocked
  • Optimistic locking version conflicts
  • Edge cases in the regex patterns

This is a test coverage gap for security-critical code.

3. Container Spawner Race Condition Not Addressed

From my original review (issue #5): The race condition in container spawning (container_spawner.py) was noted but not fixed. The response indicated this would be addressed separately, but it's still a concern for concurrent spawn requests.


📋 Summary

Item Status
Container ID injection fix ✅ Fixed
Path traversal fix ✅ Fixed
Optimistic locking ✅ Fixed
Route exception handling ⚠️ Missing in health check
Test coverage for security fixes ❌ Missing
Container spawner race condition ❌ Not addressed (noted for follow-up)

Recommendation: The health check endpoint bug should be fixed before merge. The test coverage gap is concerning for security code but can be addressed in a follow-up. The container spawner race condition was acknowledged as a separate item.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

james-in-a-box[bot] added 5 commits February 11, 2026 12:25
Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract Verification Re-Review

This is an incremental contract verification for changes since commit a841c27f8b0269e8fa14777df389b7ccfba3b396 (the merge from main).

Delta Since Last Review

One commit has been added:

0945d0c — "Guard start_pipeline against AWAITING_HUMAN and fix worktree cleanup safety"

This commit makes three targeted improvements:

  1. orchestrator/routes/pipelines.py: Added guard against starting pipelines in AWAITING_HUMAN state, returning 409 Conflict. This prevents race conditions where the orchestrator might try to start a pipeline while waiting for human approval at a HITL gate.

  2. sandbox/entrypoint.py: Added working tree validation logic that detects empty worktrees (containing only .git) and re-populates them via git checkout HEAD -- .. This fixes a startup failure mode where containers could start with empty workspaces.

  3. gateway/gateway.py: Added safety net to query Docker containers directly (get_active_docker_containers()) during startup cleanup, ensuring orphaned worktrees are cleaned even when session state is stale.

Contract Compliance Verification

No contract violations found in the delta. The changes are defensive bug fixes that improve resilience without changing the contract scope.

Change Impact on Contract
AWAITING_HUMAN guard Improves ac-12 (phase transition validation)
Worktree re-population Improves sandbox reliability (no AC change)
Docker container safety net Improves ac-8 (orphaned container cleanup)

Acceptance Criteria Status Summary

Based on previous contract reviews and my verification:

Status Count Notes
Verified 23 Core orchestrator functionality implemented
NOT VERIFIED 10 Missing files/features documented below

NOT VERIFIED Criteria (unchanged from previous review)

These acceptance criteria remain unverified due to missing implementations:

  1. ac-17 (HITL sandbox mode): File sandbox/egg_lib/hitl_mode.py does not exist
  2. ac-21 (Gateway orchestrator callback): File gateway/orchestrator_api.py does not exist
  3. ac-24 (Gateway health includes orchestrator status): No gateway health modification detected
  4. ac-25 (egg-deploy status shows both services): Partial — needs verification
  5. ac-27 (sandbox orchestrator_client.py): File sandbox/egg_lib/orchestrator_client.py does not exist
  6. ac-28 (Sandbox reports completion to orchestrator): Not implemented in sandbox entrypoint
  7. ac-29 (egg_orchestrator package in containers): Partial — needs installation verification
  8. ac-32 (Remote orchestrator mode handling): Partial — needs verification
  9. ac-33 (Architecture docs for deployment modes): Documentation missing

Recommendation

Comment — The delta since last review contains only defensive bug fixes with no contract violations. The same 10 acceptance criteria that were NOT VERIFIED in the previous review remain unverified.

The implementation of core orchestrator functionality is substantial and working (23/33 criteria met). The missing items are primarily:

  • Sandbox-side orchestrator integration (hitl_mode.py, orchestrator_client.py)
  • Gateway-orchestrator callback endpoint
  • Documentation for deployment modes

These could be addressed as follow-up work if desired.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg contract-verification completed. View run logs

21 previous review(s) hidden.

james-in-a-box[bot]
james-in-a-box bot previously approved these changes Feb 12, 2026
Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review: PR #524 - Egg Orchestrator

This re-review examines the delta since commit a841c27f8b0269e8fa14777df389b7ccfba3b396.


✅ New Changes Reviewed

1. Guard start_pipeline Against AWAITING_HUMAN Status

Location: orchestrator/routes/pipelines.py:1931-1935

if pipeline.status == PipelineStatus.AWAITING_HUMAN:
    return make_error_response(
        f"Pipeline {pipeline_id} is awaiting human approval",
        status_code=409,
    )

Correct implementation. This prevents the race condition where an external caller could try to restart a pipeline while it's blocked waiting for human approval at a HITL gate. The pipeline remains in AWAITING_HUMAN status until the decision is resolved, then transitions back to RUNNING internally (lines 1860-1861).

2. Worktree Cleanup Safety

The worktree cleanup safety is properly ensured by validate_identifier() in gateway/worktree_manager.py:72-91:

def validate_identifier(value: str, name: str) -> None:
    if not value:
        raise ValueError(f"Invalid {name}: cannot be empty")
    if ".." in value:
        raise ValueError(f"Invalid {name}: path traversal not allowed")
    if not re.match(r"^[a-zA-Z0-9][a-zA-Z0-9._-]*$", value):
        raise ValueError(f"Invalid {name}: must be alphanumeric with ._- allowed")

This validation is called in remove_worktree() (lines 430-434) and create_worktree() (lines 144-148) before any filesystem operations, preventing path traversal attacks via container_id or repo_name.


✅ Previously Identified Issues: Verified Fixed

All blocking issues from prior reviews remain properly addressed:

Item Status Location
Container ID validation ✅ Fixed docker_client.py:97-121
Pipeline ID validation ✅ Fixed state_store.py:67-89
Optimistic locking ✅ Fixed state_store.py, routes/phases.py
Exception handling in routes ✅ Fixed All route files
.env file permissions ✅ Fixed compose.py:159-160
Gateway API endpoint paths ✅ Fixed gateway_client.py
Container spawner environment ✅ Fixed Pre-registers session at lines 202-231
delete_session_by_container ✅ Fixed Uses DELETE /api/v1/sessions/by-container/{id}

⚠️ Minor Advisory (Not Blocking)

Test Coverage for AWAITING_HUMAN Guard

The new AWAITING_HUMAN guard at line 1931 does not have an explicit integration test. While the existing TestHITLGate tests cover the HITL gate flow and TestPipelineStartIdempotency covers re-starting running/completed pipelines, there's no test that specifically verifies:

  1. Creating a pipeline
  2. Starting it
  3. Letting it reach AWAITING_HUMAN status
  4. Attempting to start it again via API
  5. Asserting 409 is returned

This would be a straightforward addition to the existing integration test suite but is not blocking.


Summary

Item Status
start_pipeline AWAITING_HUMAN guard ✅ Correctly implemented
Worktree cleanup identifier validation ✅ Correctly implemented
Previous blocking issues ✅ All fixed
Test coverage for AWAITING_HUMAN guard ⚠️ Minor gap (not blocking)

Recommendation: Approve. The new changes correctly guard against starting pipelines in AWAITING_HUMAN state and the worktree cleanup safety is properly enforced via identifier validation. All blocking issues from previous reviews remain properly fixed.


— Authored by egg

@james-in-a-box
Copy link
Contributor

egg review completed. View run logs

21 previous review(s) hidden.

Docker creates the container bind mount before the gateway creates the
git worktree. Docker's tmpfs mount over .git creates a .git DIRECTORY
on the host. When create_worktree() later runs git worktree add, it
fails because .git is a directory (git worktrees need .git to be a
FILE containing a gitdir: pointer).

After the existing shutil.rmtree (which silently fails on active mount
points), explicitly check for and remove a leftover .git directory so
git worktree add can create its .git file.
james-in-a-box[bot]
james-in-a-box bot previously approved these changes Feb 12, 2026
Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

Re-review summary: The branch was rebased since my last review at commit a1c45dd. The new commit dba968c adds a worktree creation fix (removing pre-existing .git directories that Docker creates before the gateway can set up the git worktree).

This change is purely infrastructure-level—it fixes container mount timing, not agent interaction patterns. My previous assessment still applies: the orchestrator passes lightweight metadata (pipeline ID, issue number, agent role) to orient agents without constraining them. Agents fetch their own data through tools rather than receiving pre-baked content.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg agent-mode-design completed. View run logs

21 previous review(s) hidden.

Verify that POST /start returns 409 when a pipeline is in
awaiting_human status (paused at a HITL gate). The test creates a
pipeline with hitl_gates=True, waits for it to reach awaiting_human,
attempts a second start (expects 409), then resolves the gates so the
pipeline completes cleanly.
Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review: PR #524 - Egg Orchestrator

This re-review examines changes since commit 0945d0cc9421685f578be4fc91629014700262de.


✅ All Previously Identified Issues Resolved

1. Container Spawner Gateway Environment Bug

Fixed correctly. The spawn_agent_container() method now registers the gateway session and merges gateway environment variables BEFORE container creation (lines 202-225 in container_spawner.py):

# Lines 215-225: Gateway env added BEFORE container creation
gateway_env = self.gateway.get_container_env(
    session_token=session_info.session_token,
    ...
)
# Add gateway environment to container env BEFORE creation
env.update(gateway_env)

The container is then created at line 243 with the complete environment including EGG_SESSION_TOKEN, GATEWAY_URL, and proxy configuration.

2. signals.py InvalidPipelineIdError Handling

Fixed correctly. InvalidPipelineIdError is imported at line 37 and properly handled in both handle_complete_signal (lines 194-198) and handle_error_signal (lines 301-305).

3. containers.py Health Check InvalidContainerIdError Handling

Fixed correctly. The check_container_health endpoint (lines 426-454) now catches InvalidContainerIdError and returns 400.

4. compose.py .env File Permissions

Fixed correctly. Line 160 sets 0o600 permissions on the generated .env file:

env_file.chmod(0o600)  # Owner read/write only

🔵 New Code Review: gateway/worktree_manager.py

This commit adds a worktree manager that handles the Docker/.git directory race condition. The implementation is well-structured:

Security

Path traversal prevention is properly implemented:

  • validate_identifier() function (lines 79-90) validates container_id and repo_name
  • Rejects empty values, path traversal (../), and special characters
  • Pattern: ^[a-zA-Z0-9][a-zA-Z0-9._-]*$
  • Applied consistently before path construction

UID/GID validation:

  • Lines 157-160 validate uid/gid are non-negative integers before use in subprocess
  • Prevents injection via the chown command

Correctness

The Docker .git directory race condition fix:

  • Lines 195-221 detect and handle invalid worktree directories
  • If a .git directory exists (from Docker's tmpfs), it's removed before creating the actual .git file
  • This correctly addresses the issue described in the commit message

Test Coverage

Comprehensive tests in test_worktree_manager.py:

  • TestValidateIdentifier: Path traversal, empty values, special characters
  • TestWorktreeManagerDockerGitDir: Specifically tests the Docker pre-creation scenario
  • Cleanup behavior tests

Summary

Issue Status
Container spawner gateway env ✅ Fixed
signals.py InvalidPipelineIdError ✅ Fixed
containers.py health check ✅ Fixed
compose.py .env permissions ✅ Fixed
worktree_manager.py security ✅ Properly implemented
worktree_manager.py tests ✅ Adequate coverage

All previously identified blocking issues have been properly addressed. The new worktree manager code is well-implemented with appropriate security validation and test coverage.

Recommendation: Approve.


— Authored by egg

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract Verification Re-Review

This is an incremental contract verification for changes since commit 0945d0cc9421685f578be4fc91629014700262de.

Delta Since Last Review

One commit has been added:

dba968c — "Fix worktree creation when Docker pre-creates .git directory"

This commit addresses an edge case where Docker creates the container bind mount (including a .git directory via tmpfs) before the gateway creates the git worktree. Since git worktree add expects .git to be a file (containing a gitdir: pointer), not a directory, the worktree creation was failing.

Changes:

  • gateway/worktree_manager.py: +7 lines — After the existing shutil.rmtree() (which silently fails on active mount points), explicitly check for and remove a leftover .git directory
  • gateway/tests/test_worktree_manager.py: +68 lines — Two new tests covering the bug fix and normal case scenarios

Contract Compliance Verification

No contract violations found. The change is a defensive bug fix that improves worktree creation reliability without changing any contract scope.

Change Impact on Contract
.git directory cleanup Improves reliability for ac-7 (sandbox configuration)
Test coverage Regression tests for worktree edge case

Code Review Notes

The fix is correct and targeted:

  1. Only runs when worktree_path.exists() after rmtree (indicating bind mount blocked full removal)
  2. Only removes .git if it exists AND is a directory (won't interfere with valid worktrees)
  3. Uses ignore_errors=True for safety
  4. Tests verify both the fix scenario and normal operation

Acceptance Criteria Status Summary

Status Count Notes
Verified 23 Core orchestrator functionality (unchanged)
NOT VERIFIED 10 Missing files/features (unchanged from previous review)

NOT VERIFIED Criteria (unchanged)

These acceptance criteria remain unverified due to missing implementations:

  1. ac-17 — HITL sandbox mode: sandbox/egg_lib/hitl_mode.py does not exist
  2. ac-21 — Gateway orchestrator callback: gateway/orchestrator_api.py does not exist
  3. ac-24 — Gateway health includes orchestrator status: Not implemented
  4. ac-25 — egg-deploy status shows both services: Partial
  5. ac-27 — sandbox orchestrator_client.py: File does not exist
  6. ac-28 — Sandbox reports completion to orchestrator: Not implemented
  7. ac-29 — egg_orchestrator package in containers: Needs verification
  8. ac-32 — Remote orchestrator mode handling: Partial
  9. ac-33 — Architecture docs for deployment modes: Documentation missing

Recommendation

Comment — The delta contains only a targeted bug fix with proper test coverage and no contract violations. The same 10 acceptance criteria that were NOT VERIFIED in the previous review remain unverified. The core orchestrator implementation (23/33 criteria) is working as designed.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg review completed. View run logs

21 previous review(s) hidden.

@james-in-a-box
Copy link
Contributor

egg contract-verification completed. View run logs

21 previous review(s) hidden.

@jwbron jwbron merged commit d1d4440 into main Feb 12, 2026
33 checks passed
@james-in-a-box james-in-a-box bot deleted the egg/issue-496 branch February 12, 2026 07:30
Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns with the delta since my last review.

The changes since commit 885e674 introduce several new features:

  1. Multi-reviewer architecture: The pipeline now spawns typed reviewer containers (unified, agent-design, code, contract) for each phase. This aligns well with agent-mode principles — each reviewer gets a focused objective and posts verdicts directly to .egg-state/reviews/.

  2. Review prompts: The _build_review_prompt() function provides context (phase, pipeline ID, reviewer type) without pre-fetching large diffs. Reviewers are told to read draft files themselves, which aligns with guideline #1 (avoid unnecessary pre-fetching).

  3. Checker/autofixer agents: The implement phase now spawns checker and autofixer agents. The autofixer receives a lightweight failure summary (~3-4 lines) but is instructed to read the full results file itself. This is appropriate orienting context, not excessive pre-fetching.

  4. Local pipelines include PR phase: Local-mode pipelines now run all 4 phases (including PR). The gateway mode is switched to public for the PR phase only, enabling push access. This is enforced at the session level in the gateway, not via prompt instructions — correctly following guideline #5 (sandbox enforcement over prompt-level security).

  5. JSON verdict format for reviewers: Reviewers write a small JSON verdict file ({verdict, summary, feedback}). This is appropriate because the verdict is consumed by the orchestrator (machine-readable), not directly by humans. The actual review comments should still be posted in natural language to the PR.

  6. HITL gates: The hitl_gates config option pauses for human approval after refine and plan phases. Decisions are surfaced via the status endpoint, keeping humans in the loop without constraining agent exploration.

The integration tests demonstrate the architecture working end-to-end with mock sandbox containers that write appropriate artifacts.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg agent-mode-design completed. View run logs

21 previous review(s) hidden.

Copy link
Contributor

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental Contract Verification Review

Delta reviewed: dba968c..35147ac (1 commit)

New Commit Analysis

35147ac Add integration test for AWAITING_HUMAN start_pipeline guard

The new commit adds a comprehensive integration test that verifies the AWAITING_HUMAN guard in start_pipeline(). The test:

  1. Creates a pipeline with hitl_gates=True
  2. Starts the pipeline and waits for awaiting_human status
  3. Attempts to start again and verifies 409 response
  4. Verifies error message contains "awaiting human"
  5. Resolves gates and waits for clean completion

This test strengthens contract compliance for acceptance criteria related to HITL integration (ac-16 through ac-20), specifically validating the interaction between HITL gates and pipeline state management.

Contract Verification Status

The 33 acceptance criteria remain unverified in the contract file. This is expected — my role as REVIEWER is to verify criteria and mark them verified using egg-contract verify-criterion. However, the egg-contract CLI has an import error in this environment:

ImportError: attempted relative import with no known parent package

Implementation Assessment

Based on code review of the delta:

  • No regressions identified — The new test is additive and doesn't modify any existing logic
  • Test quality is high — Uses proper fixtures, handles cleanup, validates both status codes and error messages
  • No security concerns — The test file contains no vulnerabilities

Recommendation

The incremental change is correct and improves test coverage. Once the egg-contract CLI is functional, the following criteria should be verified based on the existing implementation:

Criterion Status Evidence
ac-16 Ready to verify routes/decisions.py implements POST/GET endpoints
ac-18 Ready to verify routes/decisions.py implements resolve endpoint
ac-4 Ready to verify routes/health.py, routes/pipelines.py exist

The PR continues to make solid progress toward full contract compliance.

— Authored by egg

@james-in-a-box
Copy link
Contributor

egg contract-verification completed. View run logs

21 previous review(s) hidden.

github-actions bot pushed a commit that referenced this pull request Feb 12, 2026
Update documentation to reflect egg-orchestrator component added in d1d4440:
- Add orchestrator/ directory to STRUCTURE.md with detailed file listing
- Add orchestrator to architecture component table
- Update deployment guide network topology diagram
- Add orchestrator to component documentation index
- Include new local_pipeline integration tests

Triggered by: PR #524 (d1d4440)

Authored-by: egg
jwbron added a commit that referenced this pull request Feb 12, 2026
* docs: Add orchestrator to project structure docs

Update documentation to reflect egg-orchestrator component added in d1d4440:
- Add orchestrator/ directory to STRUCTURE.md with detailed file listing
- Add orchestrator to architecture component table
- Update deployment guide network topology diagram
- Add orchestrator to component documentation index
- Include new local_pipeline integration tests

Triggered by: PR #524 (d1d4440)

Authored-by: egg

* docs: Address review feedback on orchestrator documentation

- Add missing requirements.txt to orchestrator structure listing
- Update component docs header text for link consistency

Authored-by: egg

---------

Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>
Co-authored-by: james-in-a-box[bot] <246424927+james-in-a-box[bot]@users.noreply.github.com>
james-in-a-box bot pushed a commit that referenced this pull request Feb 14, 2026
The SDLC pipeline has been fully migrated to the local distributed
orchestrator (PR #524). This removes the now-superseded GitHub Actions
implementation: 8 workflow files, 3 supporting scripts, the checks/
directory, 14 action prompt builders, and associated tests. All
documentation is updated to reference the orchestrator equivalents.

Issue: #545
Authored-by: egg
jwbron added a commit that referenced this pull request Feb 14, 2026
* Add analysis for issue #545: Remove GitHub Actions SDLC architecture

* Add implementation plan for issue #545: Remove GitHub Actions SDLC architecture

* Remove GitHub Actions SDLC workflows and supporting scripts

The SDLC pipeline has been fully migrated to the local distributed
orchestrator (PR #524). This removes the now-superseded GitHub Actions
implementation: 8 workflow files, 3 supporting scripts, the checks/
directory, 14 action prompt builders, and associated tests. All
documentation is updated to reference the orchestrator equivalents.

Issue: #545
Authored-by: egg

* Add implement phase check results

* Fix lint errors and stale test for invalid schema validation

- Fix 8 auto-fixable ruff errors: remove unused imports (F401),
  update typing imports to collections.abc (UP035), sort import
  block (I001), remove unused variable (F841)
- Fix 6 E741 ambiguous variable names: rename `l` to `line` in
  list comprehensions across SSE test files
- Fix test_load_invalid_schema_fails: Pipeline model fields became
  optional in #554, so {"id": "issue-9998"} is now valid. Updated
  test to use an invalid enum value for status instead.

Authored-by: egg

* Add implement phase check results for issue-545

* Add unified review verdict for issue-545 implement phase

* Add agent-design review verdict for issue-545 implement phase

* Add code review verdict for issue-545 implement phase

Reviewed security, correctness, robustness, and design across the v2
checkpoint system, session manager changes, workflow removals, and
orchestrator updates. No critical issues introduced by this diff.

Authored-by: egg

* Add contract review verdict for issue-545 implement phase

* Address review feedback: remove stale references and fix docstring

Remove dangling references to deleted files (sdlc-work-loop.yml,
build-sdlc-prompt.sh) from comments and docstrings. Fix redundant
phrasing in _populate_contract_from_plan docstring. PR description
updated separately to correct inaccurate claims about action/ and
config/ directory removal.

* Restore PR-operational checks, contract validator, and prompt builder

Restores files that were incorrectly removed as part of the SDLC cleanup:
- .github/scripts/checks/ directory (check_fixer, lint_check, test_check,
  merge_conflict_check, draft_validation_check, plan_yaml_check, base, run_check)
- .github/workflows/on-pull-request-contract-verify.yml
- action/build-contract-verification-prompt.sh
- tests/scripts/test_checks.py

Removes deployment_check from run_check.py registry (SDLC-only, correctly
deleted). Updates docs, README, and test-action.yml shellcheck to include
restored files.

* Remove stale check-deployment definition from phase defaults

The deployment_check.py script was deleted during merge conflict resolution
(it depended on deleted .github/scripts/checks/ base infrastructure from the
GHA SDLC removal). The CheckDefinition referencing it remained as a dangling
reference from PR #653. Remove it to avoid a potential ValueError if
run_check.py tries to dispatch an unknown 'deployment' check.

* Address review feedback: fix EDITOR handling, add subprocess timeouts, add reconnection backoff

- Fix _launch_editor to use shlex.split for multi-word $EDITOR values
  (e.g. "code --wait", "vim +10")
- Add timeout=30 to all subprocess.run calls in _commit_statefiles_to_worktree
  to prevent indefinite hangs on git lock contention
- Add exponential backoff (1s-30s) and max retry limit (20) to
  watch_pipeline reconnection loop to prevent resource exhaustion
- Use consistent word-boundary regex matching in _detect_phase for all
  phase keywords instead of mixing substring and regex strategies
- Narrow _find_repo_path exception handler from bare Exception to
  specific subprocess/OS error types

Authored-by: egg

---------

Co-authored-by: egg <egg@localhost>
Co-authored-by: egg-reviewer[bot] <261018737+egg-reviewer[bot]@users.noreply.github.com>
Co-authored-by: jwbron <8340608+jwbron@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sdlc:pr SDLC pipeline: PR in review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a dedicated orchestration system for the sdlc pipeline

1 participant