Skip to content

refactor: saga-based validation for per-hook commits and observable progress #67

@rorybyrne

Description

@rorybyrne

Context

ValidationService.run_hooks() currently executes all hooks sequentially within a single UOW scope. All run_repo.save() calls hit the same session and only commit once at scope exit. This means:

  1. No observable progress: Intermediate state transitions (PENDING → RUNNING, per-hook results) are invisible to external observers until the entire pipeline completes
  2. All-or-nothing failure: If the worker crashes mid-pipeline, all progress is lost — the run stays PENDING with no partial results
  3. No granular commits: The save() calls document intent but are no-ops within the session

This isn't a bug — the current approach is functionally correct. But as hooks are expensive OCI containers (seconds to minutes each), observable progress matters for UX.

Design

Refactor validation into a saga / outbox-driven state machine where each hook execution is its own UOW:

State machine

ValidationRun gains a current_hook_index: int field tracking pipeline position.

Each handler invocation:

  1. Runs one hook
  2. Saves result + advances current_hook_index
  3. Emits either:
    • RunNextHook (self-loop to continue pipeline)
    • ValidationCompleted / ValidationFailed (terminal)

Event flow

DepositionSubmitted
  → ValidateDeposition (creates run, emits RunNextHook for hook 0)
    → RunNextHook (runs hook 0, saves result, emits RunNextHook for hook 1)
      → RunNextHook (runs hook 1, saves result, emits ValidationCompleted)

Benefits

  • Each step commits independently → progress visible to UI polling
  • Worker crash mid-hook → stale claim retry from last committed state
  • Natural fit with existing outbox + worker infrastructure
  • Overhead of extra DB round-trip per hook is negligible vs OCI container execution time

Trade-offs

  • More events and handler invocations per validation run (one per hook instead of one total)
  • ValidationRun aggregate becomes slightly more complex (tracks position)
  • Need to handle the zero-hooks case (instant pass, same as today)

References

  • osa/domain/validation/service/validation.py — current run_hooks() implementation
  • osa/infrastructure/event/worker.py — WorkerPool / stale claim mechanism

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationrefactorInternal restructuring, no behavior change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions