Skip to content

fix: digest and prune stops permanently on database timeouts#1191

Merged
MicBun merged 2 commits intomainfrom
continueDigestAndPrune
Oct 6, 2025
Merged

fix: digest and prune stops permanently on database timeouts#1191
MicBun merged 2 commits intomainfrom
continueDigestAndPrune

Conversation

@MicBun
Copy link
Member

@MicBun MicBun commented Oct 6, 2025

Root Cause:

  • Every 1000 blocks, digest reloads config from database table digest_config
  • If database query fails (timeout, network issue), error was ignored
  • Ignored error returns enabled=false by default
  • Scheduler detects config change (true→false) and stops permanently
  • Next reload also fails, but no change detected (false==false)
  • Scheduler never restarts until node restart

Production Impact:

  • Oct 4-5: Digest stopped with "tn_digest stopped due to config disabled"
  • Database shows enabled=true but scheduler actually stopped
  • No digest runs until manual node restart

Solution:

  1. Retry logic: 15 attempts with 1 minute backoff (configurable)
  2. Fail-safe: On all retries failed, keep current config (don't stop scheduler)
  3. Context-aware: Respects context cancellation during retries
  4. Better logging: Shows config changes, retry attempts, and errors

resolves: https://github.com/trufnetwork/truf-network/issues/1256

Summary by CodeRabbit

  • New Features

    • Added resilient config reload with automatic retries and backoff.
    • New settings: reload_retry_backoff_seconds and reload_max_retries; existing reload interval retained.
    • Defaults applied when not configured; on persistent failures, the last valid config is preserved and scheduling continues.
  • Documentation

    • Updated README with “Config reload resilience” explaining retry behavior and new settings.
  • Tests

    • Added tests covering transient failures, max-retry exhaustion (preserve current config), and graceful shutdown on cancellation.

Root Cause:
- Every 1000 blocks, digest reloads config from database table digest_config
- If database query fails (timeout, network issue), error was ignored
- Ignored error returns enabled=false by default
- Scheduler detects config change (true→false) and stops permanently
- Next reload also fails, but no change detected (false==false)
- Scheduler never restarts until node restart

Production Impact:
- Oct 4-5: Digest stopped with "tn_digest stopped due to config disabled"
- Database shows enabled=true but scheduler actually stopped
- No digest runs until manual node restart

Solution:
1. Retry logic: 15 attempts with 1 minute backoff (configurable)
2. Fail-safe: On all retries failed, keep current config (don't stop scheduler)
3. Context-aware: Respects context cancellation during retries
4. Better logging: Shows config changes, retry attempts, and errors

resolves: trufnetwork/truf-network#1256
@MicBun MicBun requested a review from outerlook October 6, 2025 05:57
@MicBun MicBun self-assigned this Oct 6, 2025
@MicBun MicBun added the type: fix fixing something label Oct 6, 2025
@holdex
Copy link

holdex bot commented Oct 6, 2025

Bug Report Checklist

Status Commit Link Bug Author
❌ Not Submitted

@MicBun, please use git blame and specify the link to the commit link that has introduced this bug.

Send the following message in this PR: `@pr-time-tracker bug commit [link](url) && bug author @name`

@holdex
Copy link

holdex bot commented Oct 6, 2025

Time Submission Status

Member Status Time Action Last Update
MicBun ✅ Submitted 6h Update time Oct 6, 2025, 12:50 PM
@outerlook ❌ Missing - ⚠️ Submit time -

@coderabbitai
Copy link

coderabbitai bot commented Oct 6, 2025

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title correctly highlights the fix for the digest scheduler stopping permanently on database timeouts, which aligns with the core changes adding retry logic and fail-safe behavior, but it erroneously references "prune" which is not modified in this PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch continueDigestAndPrune

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90eb389 and 1b4d658.

📒 Files selected for processing (3)
  • extensions/tn_digest/extension.go (3 hunks)
  • extensions/tn_digest/leader_reload_test.go (3 hunks)
  • extensions/tn_digest/tn_digest.go (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
extensions/tn_digest/extension.go (1)
extensions/tn_digest/constants.go (1)
  • DefaultDigestSchedule (5-5)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (1)
  • ExtensionName (4-4)
extensions/tn_digest/leader_reload_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
  • NewEngineOperations (32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: acceptance-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a857b4 and 18413c5.

📒 Files selected for processing (4)
  • extensions/tn_digest/README.md (1 hunks)
  • extensions/tn_digest/extension.go (3 hunks)
  • extensions/tn_digest/leader_reload_test.go (3 hunks)
  • extensions/tn_digest/tn_digest.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (2)
  • ExtensionName (4-4)
  • DefaultDigestSchedule (5-5)
extensions/tn_digest/leader_reload_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
  • NewEngineOperations (32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: acceptance-test

@MicBun
Copy link
Member Author

MicBun commented Oct 6, 2025

@pr-time-tracker bug commit not caused by previous commit

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
extensions/tn_digest/tn_digest.go (1)

79-107: Consider logging config parse failures.

The configuration parsing silently ignores errors from fmt.Sscan (lines 85, 92, 100). While failing safely to defaults is appropriate, operators would benefit from warning logs when config values fail to parse, helping them detect typos or invalid values in their TOML configuration.

Example for reload_interval_blocks:

 if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" {
   var parsed int64
-  _, _ = fmt.Sscan(v, &parsed)
+  _, err := fmt.Sscan(v, &parsed)
   if parsed > 0 {
     ext.SetReloadIntervalBlocks(parsed)
+  } else if err != nil {
+    ext.Logger().Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err)
   }
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18413c5 and babadc7.

📒 Files selected for processing (3)
  • extensions/tn_digest/extension.go (3 hunks)
  • extensions/tn_digest/leader_reload_test.go (3 hunks)
  • extensions/tn_digest/tn_digest.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (2)
  • ExtensionName (4-4)
  • DefaultDigestSchedule (5-5)
extensions/tn_digest/extension.go (1)
extensions/tn_digest/constants.go (1)
  • DefaultDigestSchedule (5-5)
extensions/tn_digest/leader_reload_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
  • NewEngineOperations (32-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: acceptance-test
🔇 Additional comments (14)
extensions/tn_digest/leader_reload_test.go (4)

42-57: LGTM! Effective transient failure simulation.

The fakeDB enhancement correctly simulates transient database failures for testing the retry logic. The implementation is clear and allows tests to control when failures occur and when they succeed.


289-333: LGTM! Test validates transient failure recovery.

The test correctly verifies that transient database failures are recovered through background retries, and the scheduler is successfully restarted with the new configuration. The assertions appropriately check both the schedule update and the retry attempt count.


335-374: LGTM! Critical fail-safe behavior validated.

This test validates the key fix for the original issue: when all config reload retries fail, the scheduler continues running with the current configuration instead of stopping. This prevents the production incident where failed DB queries caused the digest to stop permanently.


376-417: LGTM! Cancellation behavior validated.

The test correctly verifies that the retry worker respects context cancellation and exits gracefully, preventing goroutine leaks during extension shutdown.

extensions/tn_digest/tn_digest.go (4)

109-111: LGTM! Appropriate default fallback.

Setting a sensible default reload interval when not configured ensures the extension functions correctly even without explicit configuration.


116-117: LGTM! Appropriate worker lifecycle management.

Starting the retry worker in the engine ready hook ensures it's available throughout the extension's lifetime for handling config reload failures.


195-203: LGTM! Non-blocking approach correctly implemented.

The single immediate attempt in the end-block hook, with background retry signaling on failure, correctly addresses the previous blocking concern. This ensures consensus progress is never stalled by database retry timers.

Note: This implementation resolves the issue flagged in the past review comment about blocking the consensus goroutine.


208-239: LGTM! Comprehensive config change handling with good observability.

The enhanced logging provides excellent traceability for config changes, including the specific values that changed and the current leadership status. The state transition logic correctly handles enable/disable scenarios and scheduler lifecycle.

extensions/tn_digest/extension.go (6)

41-48: LGTM! Appropriate fields for retry worker state.

The added fields correctly support the retry worker implementation. The buffered channel for retrySignal appropriately prevents blocking when signaling retries.


111-124: LGTM! Sensible default values for retry behavior.

The defaults (1 minute backoff, 15 max retries) provide reasonable resilience against transient database issues while preventing indefinite retry loops.


131-147: LGTM! Clean worker lifecycle management.

The start/stop implementation correctly manages the background goroutine lifecycle with proper idempotency checks and context cancellation.


149-167: LGTM! Correct signal handling and worker loop.

The worker loop correctly waits for retry signals or context cancellation, and the non-blocking signal implementation prevents potential deadlocks.


169-197: LGTM! Retry logic correctly implements fail-safe behavior.

The retry implementation correctly:

  • Respects context cancellation during backoff
  • Applies the new config on successful reload
  • Preserves the current config when all retries fail (preventing the original production issue)

237-243: LGTM! Proper cleanup in Close().

Stopping the retry worker during extension closure correctly prevents goroutine leaks.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
extensions/tn_digest/tn_digest.go (3)

83-89: Verify error handling in config parsing.

The fmt.Sscan calls ignore parse errors, which means invalid config values (e.g., reload_interval_blocks = "invalid") silently fall back to zero and then to defaults. While this provides resilience, it could mask configuration mistakes.

Consider logging a warning when parsing fails:

 if v, ok2 := m["reload_interval_blocks"]; ok2 && v != "" {
   var parsed int64
-  _, _ = fmt.Sscan(v, &parsed)
+  if _, err := fmt.Sscan(v, &parsed); err != nil {
+    logger.Warn("failed to parse reload_interval_blocks, using default", "value", v, "error", err)
+  }
   if parsed > 0 {
     ext.SetReloadIntervalBlocks(parsed)
   }
 }

91-97: Same parsing concern applies to reload_retry_backoff_seconds.

Consider adding parse error logging similar to the previous comment.


99-105: Same parsing concern applies to reload_max_retries.

Consider adding parse error logging similar to the previous comment.

extensions/tn_digest/extension.go (1)

185-197: Verify error handling for context cancellation during LoadDigestConfig.

If retryWorkerCtx is cancelled while LoadDigestConfig is executing, it may return an error (likely a context.Canceled error). The current code logs this as a failed attempt and continues retrying (line 193), but it should probably recognize cancellation and exit early.

Add explicit context cancellation check after LoadDigestConfig fails:

 enabled, schedule, err := e.EngineOps().LoadDigestConfig(e.retryWorkerCtx)
 if err == nil {
   // Success! Update config (app=nil since we're in background, service already cached)
   e.Logger().Info("config reload succeeded in background", "attempt", attempt, "enabled", enabled, "schedule", schedule)
   e.applyConfigChangeWithLock(e.retryWorkerCtx, enabled, schedule, nil)
   return
 }
+
+// Check if context was cancelled
+if e.retryWorkerCtx.Err() != nil {
+  e.Logger().Info("retry worker cancelled during config reload")
+  return
+}

 e.Logger().Warn("config reload attempt failed", "attempt", attempt, "error", err)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between babadc7 and 90eb389.

📒 Files selected for processing (2)
  • extensions/tn_digest/extension.go (3 hunks)
  • extensions/tn_digest/tn_digest.go (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/extension.go (1)
extensions/tn_digest/constants.go (1)
  • DefaultDigestSchedule (5-5)
extensions/tn_digest/tn_digest.go (1)
extensions/tn_digest/constants.go (1)
  • ExtensionName (4-4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: acceptance-test
🔇 Additional comments (10)
extensions/tn_digest/tn_digest.go (3)

108-111: LGTM: Proper default handling.

The default interval (1000 blocks) is applied only when not configured or set to zero, which is the correct fail-safe behavior.


116-117: LGTM: Background retry worker properly initialized.

Starting the retry worker during engine ready ensures it's available before any config reload failures can occur, which is the correct initialization sequence.


195-208: ✅ Critical issue resolved: Non-blocking approach implemented.

The previous blocking retry loop has been replaced with:

  1. Single immediate LoadDigestConfig attempt (lines 197-203)
  2. On failure: log warning, signal background worker, return immediately
  3. On success: apply config via centralized applyConfigChangeWithLock

This design ensures the end-block hook never sleeps or blocks consensus progress, resolving the critical issue from the previous review.

Based on past review comments.

extensions/tn_digest/extension.go (7)

41-48: LGTM: Appropriate state for retry mechanism.

The new fields properly support the retry worker:

  • reloadRetryBackoff and reloadMaxRetries for configuration
  • retryWorkerCtx and retryWorkerCancel for lifecycle management
  • retrySignal buffered channel prevents blocking on signal
  • retryMu protects config changes from races

111-124: LGTM: Inline defaults prevent zero-value issues.

The getters return sensible defaults (1 minute backoff, 15 retries) when fields are zero, ensuring the retry mechanism always has valid configuration.


158-167: LGTM: Worker loop properly handles cancellation.

The loop correctly prioritizes context cancellation (line 161) over retry signals, ensuring clean shutdown. The signal is non-blocking due to the buffered channel (line 137).


174-183: LGTM: Context-aware backoff respects cancellation.

The backoff implementation (lines 177-182) properly checks retryWorkerCtx.Done() before sleeping, allowing the worker to terminate promptly during shutdown.


199-242: ✅ Critical issue resolved: Race condition eliminated and duplication removed.

The new applyConfigChangeWithLock method addresses both critical issues from the previous review:

  1. Race condition fixed: retryMu (line 202) protects config changes from concurrent access by both the end-block path (tn_digest.go:206) and the retry worker (extension.go:189).

  2. Code duplication eliminated: The ~40 lines of config change logic are now centralized in this single method, called from both paths.

The implementation correctly handles:

  • Default schedule fallback (lines 205-207)
  • Config change detection (line 209)
  • Scheduler lifecycle (stop on disable, restart on enable with leadership)
  • Service updates from app context (lines 222-227)

Based on past review comments.


244-249: LGTM: Proper cleanup on Close.

The stopRetryWorker call (line 246) ensures the background goroutine is cancelled and cleaned up before stopping the scheduler, preventing resource leaks.


132-139: No need to clear retrySignal in stopRetryWorker
startRetryWorker always reinitializes retrySignal, and sending on a nil channel in a select safely falls through to default, so resetting it is unnecessary.

Likely an incorrect or invalid review comment.

@MicBun MicBun force-pushed the continueDigestAndPrune branch from 90eb389 to 1b4d658 Compare October 6, 2025 07:03
@MicBun MicBun requested review from outerlook and removed request for outerlook October 6, 2025 12:21
@MicBun MicBun merged commit e5be54a into main Oct 6, 2025
7 of 8 checks passed
@MicBun MicBun deleted the continueDigestAndPrune branch October 6, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix fixing something

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants