fix: digest and prune stops permanently on database timeouts#1191
fix: digest and prune stops permanently on database timeouts#1191
Conversation
Root Cause: - Every 1000 blocks, digest reloads config from database table digest_config - If database query fails (timeout, network issue), error was ignored - Ignored error returns enabled=false by default - Scheduler detects config change (true→false) and stops permanently - Next reload also fails, but no change detected (false==false) - Scheduler never restarts until node restart Production Impact: - Oct 4-5: Digest stopped with "tn_digest stopped due to config disabled" - Database shows enabled=true but scheduler actually stopped - No digest runs until manual node restart Solution: 1. Retry logic: 15 attempts with 1 minute backoff (configurable) 2. Fail-safe: On all retries failed, keep current config (don't stop scheduler) 3. Context-aware: Respects context cancellation during retries 4. Better logging: Shows config changes, retry attempts, and errors resolves: trufnetwork/truf-network#1256
Bug Report Checklist
@MicBun, please use git blame and specify the link to the commit link that has introduced this bug. |
Time Submission Status
|
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used🧬 Code graph analysis (3)extensions/tn_digest/extension.go (1)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/leader_reload_test.go (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
extensions/tn_digest/README.md(1 hunks)extensions/tn_digest/extension.go(3 hunks)extensions/tn_digest/leader_reload_test.go(3 hunks)extensions/tn_digest/tn_digest.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (2)
ExtensionName(4-4)DefaultDigestSchedule(5-5)
extensions/tn_digest/leader_reload_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
NewEngineOperations(32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: acceptance-test
|
@pr-time-tracker bug commit not caused by previous commit |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
extensions/tn_digest/tn_digest.go (1)
79-107: Consider logging config parse failures.The configuration parsing silently ignores errors from
fmt.Sscan(lines 85, 92, 100). While failing safely to defaults is appropriate, operators would benefit from warning logs when config values fail to parse, helping them detect typos or invalid values in their TOML configuration.Example for reload_interval_blocks:
if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" { var parsed int64 - _, _ = fmt.Sscan(v, &parsed) + _, err := fmt.Sscan(v, &parsed) if parsed > 0 { ext.SetReloadIntervalBlocks(parsed) + } else if err != nil { + ext.Logger().Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err) } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
extensions/tn_digest/extension.go(3 hunks)extensions/tn_digest/leader_reload_test.go(3 hunks)extensions/tn_digest/tn_digest.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (2)
ExtensionName(4-4)DefaultDigestSchedule(5-5)
extensions/tn_digest/extension.go (1)
extensions/tn_digest/constants.go (1)
DefaultDigestSchedule(5-5)
extensions/tn_digest/leader_reload_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
NewEngineOperations(32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: acceptance-test
🔇 Additional comments (14)
extensions/tn_digest/leader_reload_test.go (4)
42-57: LGTM! Effective transient failure simulation.The fakeDB enhancement correctly simulates transient database failures for testing the retry logic. The implementation is clear and allows tests to control when failures occur and when they succeed.
289-333: LGTM! Test validates transient failure recovery.The test correctly verifies that transient database failures are recovered through background retries, and the scheduler is successfully restarted with the new configuration. The assertions appropriately check both the schedule update and the retry attempt count.
335-374: LGTM! Critical fail-safe behavior validated.This test validates the key fix for the original issue: when all config reload retries fail, the scheduler continues running with the current configuration instead of stopping. This prevents the production incident where failed DB queries caused the digest to stop permanently.
376-417: LGTM! Cancellation behavior validated.The test correctly verifies that the retry worker respects context cancellation and exits gracefully, preventing goroutine leaks during extension shutdown.
extensions/tn_digest/tn_digest.go (4)
109-111: LGTM! Appropriate default fallback.Setting a sensible default reload interval when not configured ensures the extension functions correctly even without explicit configuration.
116-117: LGTM! Appropriate worker lifecycle management.Starting the retry worker in the engine ready hook ensures it's available throughout the extension's lifetime for handling config reload failures.
195-203: LGTM! Non-blocking approach correctly implemented.The single immediate attempt in the end-block hook, with background retry signaling on failure, correctly addresses the previous blocking concern. This ensures consensus progress is never stalled by database retry timers.
Note: This implementation resolves the issue flagged in the past review comment about blocking the consensus goroutine.
208-239: LGTM! Comprehensive config change handling with good observability.The enhanced logging provides excellent traceability for config changes, including the specific values that changed and the current leadership status. The state transition logic correctly handles enable/disable scenarios and scheduler lifecycle.
extensions/tn_digest/extension.go (6)
41-48: LGTM! Appropriate fields for retry worker state.The added fields correctly support the retry worker implementation. The buffered channel for
retrySignalappropriately prevents blocking when signaling retries.
111-124: LGTM! Sensible default values for retry behavior.The defaults (1 minute backoff, 15 max retries) provide reasonable resilience against transient database issues while preventing indefinite retry loops.
131-147: LGTM! Clean worker lifecycle management.The start/stop implementation correctly manages the background goroutine lifecycle with proper idempotency checks and context cancellation.
149-167: LGTM! Correct signal handling and worker loop.The worker loop correctly waits for retry signals or context cancellation, and the non-blocking signal implementation prevents potential deadlocks.
169-197: LGTM! Retry logic correctly implements fail-safe behavior.The retry implementation correctly:
- Respects context cancellation during backoff
- Applies the new config on successful reload
- Preserves the current config when all retries fail (preventing the original production issue)
237-243: LGTM! Proper cleanup in Close().Stopping the retry worker during extension closure correctly prevents goroutine leaks.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
extensions/tn_digest/tn_digest.go (3)
83-89: Verify error handling in config parsing.The
fmt.Sscancalls ignore parse errors, which means invalid config values (e.g.,reload_interval_blocks = "invalid") silently fall back to zero and then to defaults. While this provides resilience, it could mask configuration mistakes.Consider logging a warning when parsing fails:
if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" { var parsed int64 - _, _ = fmt.Sscan(v, &parsed) + if _, err := fmt.Sscan(v, &parsed); err != nil { + logger.Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err) + } if parsed > 0 { ext.SetReloadIntervalBlocks(parsed) } }
91-97: Same parsing concern applies to reload_retry_backoff_seconds.Consider adding parse error logging similar to the previous comment.
99-105: Same parsing concern applies to reload_max_retries.Consider adding parse error logging similar to the previous comment.
extensions/tn_digest/extension.go (1)
185-197: Verify error handling for context cancellation during LoadDigestConfig.If
retryWorkerCtxis cancelled whileLoadDigestConfigis executing, it may return an error (likely a context.Canceled error). The current code logs this as a failed attempt and continues retrying (line 193), but it should probably recognize cancellation and exit early.Add explicit context cancellation check after LoadDigestConfig fails:
enabled, schedule, err := e.EngineOps().LoadDigestConfig(e.retryWorkerCtx) if err == nil { // Success! Update config (app=nil since we're in background, service already cached) e.Logger().Info("config reload succeeded in background", "attempt", attempt, "enabled", enabled, "schedule", schedule) e.applyConfigChangeWithLock(e.retryWorkerCtx, enabled, schedule, nil) return } + +// Check if context was cancelled +if e.retryWorkerCtx.Err() != nil { + e.Logger().Info("retry worker cancelled during config reload") + return +} e.Logger().Warn("config reload attempt failed", "attempt", attempt, "error", err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
extensions/tn_digest/extension.go(3 hunks)extensions/tn_digest/tn_digest.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/extension.go (1)
extensions/tn_digest/constants.go (1)
DefaultDigestSchedule(5-5)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (1)
ExtensionName(4-4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: acceptance-test
🔇 Additional comments (10)
extensions/tn_digest/tn_digest.go (3)
108-111: LGTM: Proper default handling.The default interval (1000 blocks) is applied only when not configured or set to zero, which is the correct fail-safe behavior.
116-117: LGTM: Background retry worker properly initialized.Starting the retry worker during engine ready ensures it's available before any config reload failures can occur, which is the correct initialization sequence.
195-208: ✅ Critical issue resolved: Non-blocking approach implemented.The previous blocking retry loop has been replaced with:
- Single immediate LoadDigestConfig attempt (lines 197-203)
- On failure: log warning, signal background worker, return immediately
- On success: apply config via centralized applyConfigChangeWithLock
This design ensures the end-block hook never sleeps or blocks consensus progress, resolving the critical issue from the previous review.
Based on past review comments.
extensions/tn_digest/extension.go (7)
41-48: LGTM: Appropriate state for retry mechanism.The new fields properly support the retry worker:
reloadRetryBackoffandreloadMaxRetriesfor configurationretryWorkerCtxandretryWorkerCancelfor lifecycle managementretrySignalbuffered channel prevents blocking on signalretryMuprotects config changes from races
111-124: LGTM: Inline defaults prevent zero-value issues.The getters return sensible defaults (1 minute backoff, 15 retries) when fields are zero, ensuring the retry mechanism always has valid configuration.
158-167: LGTM: Worker loop properly handles cancellation.The loop correctly prioritizes context cancellation (line 161) over retry signals, ensuring clean shutdown. The signal is non-blocking due to the buffered channel (line 137).
174-183: LGTM: Context-aware backoff respects cancellation.The backoff implementation (lines 177-182) properly checks
retryWorkerCtx.Done()before sleeping, allowing the worker to terminate promptly during shutdown.
199-242: ✅ Critical issue resolved: Race condition eliminated and duplication removed.The new
applyConfigChangeWithLockmethod addresses both critical issues from the previous review:
Race condition fixed:
retryMu(line 202) protects config changes from concurrent access by both the end-block path (tn_digest.go:206) and the retry worker (extension.go:189).Code duplication eliminated: The ~40 lines of config change logic are now centralized in this single method, called from both paths.
The implementation correctly handles:
- Default schedule fallback (lines 205-207)
- Config change detection (line 209)
- Scheduler lifecycle (stop on disable, restart on enable with leadership)
- Service updates from app context (lines 222-227)
Based on past review comments.
244-249: LGTM: Proper cleanup on Close.The
stopRetryWorkercall (line 246) ensures the background goroutine is cancelled and cleaned up before stopping the scheduler, preventing resource leaks.
132-139: No need to clear retrySignal in stopRetryWorker
startRetryWorker always reinitializes retrySignal, and sending on a nil channel in a select safely falls through to default, so resetting it is unnecessary.Likely an incorrect or invalid review comment.
90eb389 to
1b4d658
Compare
Root Cause:
Production Impact:
Solution:
resolves: https://github.com/trufnetwork/truf-network/issues/1256
Summary by CodeRabbit
New Features
Documentation
Tests