Skip to content

feat(coordinator): add distributed coordination system for multi-instance Fess#3101

Merged
marevol merged 3 commits intomasterfrom
feat/coordinator-helper
Mar 28, 2026
Merged

feat(coordinator): add distributed coordination system for multi-instance Fess#3101
marevol merged 3 commits intomasterfrom
feat/coordinator-helper

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented Mar 27, 2026

Summary

Add CoordinatorHelper to enable inter-instance coordination via OpenSearch, preventing concurrent execution of maintenance operations across multiple Fess instances connected to the same cluster.

Changes Made

  • CoordinatorHelper (new): Core helper with heartbeat, operation state management, and event system
    • Instance heartbeat registration with TTL-based liveness detection
    • Distributed mutex using OpenSearch op_type=create for atomic lock acquisition
    • Ownership-verified lock release with if_seq_no/if_primary_term optimistic concurrency
    • Event publishing/consumption for inter-instance notifications
    • Periodic polling loop (60s) for heartbeat updates, event consumption, and expired document cleanup
  • SystemHelper.getInstanceId(): Process-unique instance ID (hostname + PID)
  • AdminMaintenanceAction: Lock control for reindex, config rebuild, reload, and crawler index clear
  • fess_config.coordinator index: Single-shard OpenSearch index for atomicity guarantee
  • Error messages: errors.operation_already_running in 17 languages
  • Configuration: coordinator.poll.interval, coordinator.heartbeat.ttl, coordinator.operation.ttl, coordinator.event.ttl

Testing

  • 134 unit tests covering CoordinatorHelper (JSON utils, data classes, event handlers, document structure parsing), SystemHelper.getInstanceId(), and AdminMaintenanceAction
  • Codex review gate passed (architecture + diff reviews, 6 blocking issues found and fixed)

Breaking Changes

  • None. New feature addition only.

Additional Notes

  • The coordinator index uses number_of_shards: 1 to guarantee op_type=create atomicity on the same primary shard
  • Future extensibility: config change notifications, dictionary update propagation, admin UI instance monitoring

…ance Fess

Add CoordinatorHelper to enable inter-instance coordination via OpenSearch,
preventing concurrent execution of maintenance operations (reindex, config
rebuild, etc.) across multiple Fess instances connected to the same cluster.

Key features:
- Instance heartbeat registration with TTL-based liveness detection
- Operation state management using op_type=create for distributed mutex
- Event publishing/consumption for inter-instance notifications
- Periodic polling loop (60s) for heartbeat, event consumption, and cleanup

Also adds:
- SystemHelper.getInstanceId() with hostname+PID for process uniqueness
- fess_config.coordinator index with single-shard atomicity guarantee
- Lock control in AdminMaintenanceAction for all maintenance operations
- Ownership-verified lock release with optimistic concurrency control
- Error messages in 17 languages
- Comprehensive unit tests (134 tests)
@marevol marevol self-assigned this Mar 27, 2026
@marevol marevol added this to the 15.6.0 milestone Mar 27, 2026
marevol added 2 commits March 28, 2026 08:48
- Add retry limit (coordinator.operation.retry) to prevent infinite
  recursion in tryCleanupAndRetry, configurable via fess_config.properties
- Unify lock release to use completeOperation in all code paths for
  idempotent and safe double-call behavior
- Add refresh=true to sendHeartbeat for immediate search visibility
- Use typed FessConfig accessors instead of raw string key access
- Fix event time tracking to avoid same-millisecond event loss
- Remove unused createdBy field from coordinator index mapping
- Add Javadoc to all public/protected methods and data classes
- Add 23 unit tests covering retry logic, lock safety, poll loop,
  config accessors, and event time advancement
Remove failOperation method that simply delegates to completeOperation,
reorder app.xml components alphabetically, and replace 22 test cases in
CoordinatorHelperTest that were not actually testing CoordinatorHelper
(testing Java arithmetic, FessConfig constants, or duplicating logic)
with proper tests using mocked CurlHelper to verify actual method behavior.
@marevol marevol merged commit 95da178 into master Mar 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant