fix: digest and prune stops permanently on database timeouts by MicBun · Pull Request #1191 · trufnetwork/node

MicBun · 2025-10-06T05:57:49Z

Root Cause:

Every 1000 blocks, digest reloads config from database table digest_config
If database query fails (timeout, network issue), error was ignored
Ignored error returns enabled=false by default
Scheduler detects config change (true→false) and stops permanently
Next reload also fails, but no change detected (false==false)
Scheduler never restarts until node restart

Production Impact:

Oct 4-5: Digest stopped with "tn_digest stopped due to config disabled"
Database shows enabled=true but scheduler actually stopped
No digest runs until manual node restart

Solution:

Retry logic: 15 attempts with 1 minute backoff (configurable)
Fail-safe: On all retries failed, keep current config (don't stop scheduler)
Context-aware: Respects context cancellation during retries
Better logging: Shows config changes, retry attempts, and errors

resolves: https://github.com/trufnetwork/truf-network/issues/1256

Summary by CodeRabbit

New Features
- Added resilient config reload with automatic retries and backoff.
- New settings: reload_retry_backoff_seconds and reload_max_retries; existing reload interval retained.
- Defaults applied when not configured; on persistent failures, the last valid config is preserved and scheduling continues.
Documentation
- Updated README with “Config reload resilience” explaining retry behavior and new settings.
Tests
- Added tests covering transient failures, max-retry exhaustion (preserve current config), and graceful shutdown on cancellation.

Root Cause: - Every 1000 blocks, digest reloads config from database table digest_config - If database query fails (timeout, network issue), error was ignored - Ignored error returns enabled=false by default - Scheduler detects config change (true→false) and stops permanently - Next reload also fails, but no change detected (false==false) - Scheduler never restarts until node restart Production Impact: - Oct 4-5: Digest stopped with "tn_digest stopped due to config disabled" - Database shows enabled=true but scheduler actually stopped - No digest runs until manual node restart Solution: 1. Retry logic: 15 attempts with 1 minute backoff (configurable) 2. Fail-safe: On all retries failed, keep current config (don't stop scheduler) 3. Context-aware: Respects context cancellation during retries 4. Better logging: Shows config changes, retry attempts, and errors resolves: trufnetwork/truf-network#1256

holdex · 2025-10-06T05:58:00Z

Bug Report Checklist

Status	Commit Link	Bug Author
❌ Not Submitted	–	–

@MicBun, please use git blame and specify the link to the commit link that has introduced this bug.

Send the following message in this PR: `@pr-time-tracker bug commit [link](url) && bug author @name`

holdex · 2025-10-06T05:58:01Z

Time Submission Status

Member	Status	Time	Action	Last Update
MicBun	✅ Submitted	6h	Update time	Oct 6, 2025, 12:50 PM
@outerlook	❌ Missing	-	⚠️ Submit time	-

coderabbitai · 2025-10-06T05:58:12Z

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title correctly highlights the fix for the digest scheduler stopping permanently on database timeouts, which aligns with the core changes adding retry logic and fail-safe behavior, but it erroneously references "prune" which is not modified in this PR.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch continueDigestAndPrune

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90eb389 and 1b4d658.

📒 Files selected for processing (3)

extensions/tn_digest/extension.go (3 hunks)
extensions/tn_digest/leader_reload_test.go (3 hunks)
extensions/tn_digest/tn_digest.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

extensions/tn_digest/extension.go (1)

extensions/tn_digest/constants.go (1)

DefaultDigestSchedule (5-5)

extensions/tn_digest/tn_digest.go (1)

extensions/tn_digest/constants.go (1)

ExtensionName (4-4)

extensions/tn_digest/leader_reload_test.go (1)

extensions/tn_digest/internal/engine_ops.go (1)

NewEngineOperations (32-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: acceptance-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a857b4 and 18413c5.

📒 Files selected for processing (4)

extensions/tn_digest/README.md (1 hunks)
extensions/tn_digest/extension.go (3 hunks)
extensions/tn_digest/leader_reload_test.go (3 hunks)
extensions/tn_digest/tn_digest.go (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

extensions/tn_digest/tn_digest.go (1)

extensions/tn_digest/constants.go (2)

ExtensionName (4-4)

DefaultDigestSchedule (5-5)

extensions/tn_digest/leader_reload_test.go (1)

extensions/tn_digest/internal/engine_ops.go (1)

NewEngineOperations (32-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: acceptance-test

extensions/tn_digest/tn_digest.go

MicBun · 2025-10-06T06:39:19Z

@pr-time-tracker bug commit not caused by previous commit

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

extensions/tn_digest/tn_digest.go (1)
79-107: Consider logging config parse failures.

The configuration parsing silently ignores errors from fmt.Sscan (lines 85, 92, 100). While failing safely to defaults is appropriate, operators would benefit from warning logs when config values fail to parse, helping them detect typos or invalid values in their TOML configuration.

Example for reload_interval_blocks:
 if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" {
   var parsed int64
-  _, _ = fmt.Sscan(v, &parsed)
+  _, err := fmt.Sscan(v, &parsed)
   if parsed > 0 {
     ext.SetReloadIntervalBlocks(parsed)
+  } else if err != nil {
+    ext.Logger().Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err)
   }
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18413c5 and babadc7.

📒 Files selected for processing (3)

extensions/tn_digest/extension.go (3 hunks)
extensions/tn_digest/leader_reload_test.go (3 hunks)
extensions/tn_digest/tn_digest.go (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

extensions/tn_digest/tn_digest.go (1)

extensions/tn_digest/constants.go (2)

ExtensionName (4-4)

DefaultDigestSchedule (5-5)

extensions/tn_digest/extension.go (1)

extensions/tn_digest/constants.go (1)

DefaultDigestSchedule (5-5)

extensions/tn_digest/leader_reload_test.go (1)

extensions/tn_digest/internal/engine_ops.go (1)

NewEngineOperations (32-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: acceptance-test

🔇 Additional comments (14)

extensions/tn_digest/leader_reload_test.go (4)

42-57: LGTM! Effective transient failure simulation.

The fakeDB enhancement correctly simulates transient database failures for testing the retry logic. The implementation is clear and allows tests to control when failures occur and when they succeed.

289-333: LGTM! Test validates transient failure recovery.

The test correctly verifies that transient database failures are recovered through background retries, and the scheduler is successfully restarted with the new configuration. The assertions appropriately check both the schedule update and the retry attempt count.

335-374: LGTM! Critical fail-safe behavior validated.

This test validates the key fix for the original issue: when all config reload retries fail, the scheduler continues running with the current configuration instead of stopping. This prevents the production incident where failed DB queries caused the digest to stop permanently.

376-417: LGTM! Cancellation behavior validated.

The test correctly verifies that the retry worker respects context cancellation and exits gracefully, preventing goroutine leaks during extension shutdown.

extensions/tn_digest/tn_digest.go (4)

109-111: LGTM! Appropriate default fallback.

Setting a sensible default reload interval when not configured ensures the extension functions correctly even without explicit configuration.

116-117: LGTM! Appropriate worker lifecycle management.

Starting the retry worker in the engine ready hook ensures it's available throughout the extension's lifetime for handling config reload failures.

195-203: LGTM! Non-blocking approach correctly implemented.

The single immediate attempt in the end-block hook, with background retry signaling on failure, correctly addresses the previous blocking concern. This ensures consensus progress is never stalled by database retry timers.

Note: This implementation resolves the issue flagged in the past review comment about blocking the consensus goroutine.

208-239: LGTM! Comprehensive config change handling with good observability.

The enhanced logging provides excellent traceability for config changes, including the specific values that changed and the current leadership status. The state transition logic correctly handles enable/disable scenarios and scheduler lifecycle.

extensions/tn_digest/extension.go (6)

41-48: LGTM! Appropriate fields for retry worker state.

The added fields correctly support the retry worker implementation. The buffered channel for retrySignal appropriately prevents blocking when signaling retries.

111-124: LGTM! Sensible default values for retry behavior.

The defaults (1 minute backoff, 15 max retries) provide reasonable resilience against transient database issues while preventing indefinite retry loops.

131-147: LGTM! Clean worker lifecycle management.

The start/stop implementation correctly manages the background goroutine lifecycle with proper idempotency checks and context cancellation.

149-167: LGTM! Correct signal handling and worker loop.

The worker loop correctly waits for retry signals or context cancellation, and the non-blocking signal implementation prevents potential deadlocks.

169-197: LGTM! Retry logic correctly implements fail-safe behavior.

The retry implementation correctly:

Respects context cancellation during backoff

Applies the new config on successful reload

Preserves the current config when all retries fail (preventing the original production issue)

237-243: LGTM! Proper cleanup in Close().

Stopping the retry worker during extension closure correctly prevents goroutine leaks.

extensions/tn_digest/extension.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

extensions/tn_digest/tn_digest.go (3)
83-89: Verify error handling in config parsing.

The fmt.Sscan calls ignore parse errors, which means invalid config values (e.g., reload_interval_blocks = "invalid") silently fall back to zero and then to defaults. While this provides resilience, it could mask configuration mistakes.

Consider logging a warning when parsing fails:
 if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" {
   var parsed int64
-  _, _ = fmt.Sscan(v, &parsed)
+  if _, err := fmt.Sscan(v, &parsed); err != nil {
+    logger.Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err)
+  }
   if parsed > 0 {
     ext.SetReloadIntervalBlocks(parsed)
   }
 }
91-97: Same parsing concern applies to reload_retry_backoff_seconds.

Consider adding parse error logging similar to the previous comment.

99-105: Same parsing concern applies to reload_max_retries.

Consider adding parse error logging similar to the previous comment.
extensions/tn_digest/extension.go (1)
185-197: Verify error handling for context cancellation during LoadDigestConfig.

If retryWorkerCtx is cancelled while LoadDigestConfig is executing, it may return an error (likely a context.Canceled error). The current code logs this as a failed attempt and continues retrying (line 193), but it should probably recognize cancellation and exit early.

Add explicit context cancellation check after LoadDigestConfig fails:
 enabled, schedule, err := e.EngineOps().LoadDigestConfig(e.retryWorkerCtx)
 if err == nil {
   // Success! Update config (app=nil since we're in background, service already cached)
   e.Logger().Info("config reload succeeded in background", "attempt", attempt, "enabled", enabled, "schedule", schedule)
   e.applyConfigChangeWithLock(e.retryWorkerCtx, enabled, schedule, nil)
   return
 }
+
+// Check if context was cancelled
+if e.retryWorkerCtx.Err() != nil {
+  e.Logger().Info("retry worker cancelled during config reload")
+  return
+}

 e.Logger().Warn("config reload attempt failed", "attempt", attempt, "error", err)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between babadc7 and 90eb389.

📒 Files selected for processing (2)

extensions/tn_digest/extension.go (3 hunks)
extensions/tn_digest/tn_digest.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

extensions/tn_digest/extension.go (1)

extensions/tn_digest/constants.go (1)

DefaultDigestSchedule (5-5)

extensions/tn_digest/tn_digest.go (1)

extensions/tn_digest/constants.go (1)

ExtensionName (4-4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: acceptance-test

🔇 Additional comments (10)

extensions/tn_digest/tn_digest.go (3)

108-111: LGTM: Proper default handling.

The default interval (1000 blocks) is applied only when not configured or set to zero, which is the correct fail-safe behavior.

116-117: LGTM: Background retry worker properly initialized.

Starting the retry worker during engine ready ensures it's available before any config reload failures can occur, which is the correct initialization sequence.

195-208: ✅ Critical issue resolved: Non-blocking approach implemented.

The previous blocking retry loop has been replaced with:

Single immediate LoadDigestConfig attempt (lines 197-203)

On failure: log warning, signal background worker, return immediately

On success: apply config via centralized applyConfigChangeWithLock

This design ensures the end-block hook never sleeps or blocks consensus progress, resolving the critical issue from the previous review.

Based on past review comments.

extensions/tn_digest/extension.go (7)

41-48: LGTM: Appropriate state for retry mechanism.

The new fields properly support the retry worker:

reloadRetryBackoff and reloadMaxRetries for configuration

retryWorkerCtx and retryWorkerCancel for lifecycle management

retrySignal buffered channel prevents blocking on signal

retryMu protects config changes from races

111-124: LGTM: Inline defaults prevent zero-value issues.

The getters return sensible defaults (1 minute backoff, 15 retries) when fields are zero, ensuring the retry mechanism always has valid configuration.

158-167: LGTM: Worker loop properly handles cancellation.

The loop correctly prioritizes context cancellation (line 161) over retry signals, ensuring clean shutdown. The signal is non-blocking due to the buffered channel (line 137).

174-183: LGTM: Context-aware backoff respects cancellation.

The backoff implementation (lines 177-182) properly checks retryWorkerCtx.Done() before sleeping, allowing the worker to terminate promptly during shutdown.

199-242: ✅ Critical issue resolved: Race condition eliminated and duplication removed.

The new applyConfigChangeWithLock method addresses both critical issues from the previous review:

Race condition fixed: retryMu (line 202) protects config changes from concurrent access by both the end-block path (tn_digest.go:206) and the retry worker (extension.go:189).

Code duplication eliminated: The ~40 lines of config change logic are now centralized in this single method, called from both paths.

The implementation correctly handles:

Default schedule fallback (lines 205-207)

Config change detection (line 209)

Scheduler lifecycle (stop on disable, restart on enable with leadership)

Service updates from app context (lines 222-227)

Based on past review comments.

244-249: LGTM: Proper cleanup on Close.

The stopRetryWorker call (line 246) ensures the background goroutine is cancelled and cleaned up before stopping the scheduler, preventing resource leaks.

132-139: No need to clear retrySignal in stopRetryWorker
startRetryWorker always reinitializes retrySignal, and sending on a nil channel in a select safely falls through to default, so resetting it is unnecessary.

Likely an incorrect or invalid review comment.

MicBun requested a review from outerlook October 6, 2025 05:57

MicBun self-assigned this Oct 6, 2025

MicBun added the type: fix fixing something label Oct 6, 2025

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

extensions/tn_digest/tn_digest.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

extensions/tn_digest/extension.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

chore: apply suggestion

1b4d658

MicBun force-pushed the continueDigestAndPrune branch from 90eb389 to 1b4d658 Compare October 6, 2025 07:03

MicBun requested review from outerlook and removed request for outerlook October 6, 2025 12:21

outerlook approved these changes Oct 6, 2025

View reviewed changes

MicBun merged commit e5be54a into main Oct 6, 2025
7 of 8 checks passed

MicBun deleted the continueDigestAndPrune branch October 6, 2025 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: digest and prune stops permanently on database timeouts#1191

fix: digest and prune stops permanently on database timeouts#1191
MicBun merged 2 commits intomainfrom
continueDigestAndPrune

MicBun commented Oct 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

holdex bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

holdex bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

MicBun commented Oct 6, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MicBun commented Oct 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

holdex bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Report Checklist

Uh oh!

holdex bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Time Submission Status

Uh oh!

coderabbitai bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MicBun commented Oct 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MicBun commented Oct 6, 2025 •

edited by coderabbitai bot

Loading

holdex bot commented Oct 6, 2025 •

edited

Loading

holdex bot commented Oct 6, 2025 •

edited

Loading

coderabbitai bot commented Oct 6, 2025 •

edited

Loading