fix: digest failing to prune pending days due to nonce errors#1188
fix: digest failing to prune pending days due to nonce errors#1188
Conversation
Resolves issue where digest operations failed to complete, leaving 596K+ pending prune days unprocessed. The digest scheduler halted midway through processing due to transaction nonce collisions during retries. **Problem:** - Digest stopped processing on Sept 30, leaving days 20353-20361+ pending - 596,375 days accumulated in pending_prune_days table - Root cause: broadcast timeout → retry used same nonce → "transaction already exists" - Concurrent transactions from same account caused nonce conflicts - System required manual intervention to resume **Solution:** Implemented stateless retry logic that always refetches fresh nonce from database on each attempt. This automatically handles: - Network timeouts - Concurrent transaction activity - Nonce gaps and database state changes **Implementation:** - Added `BroadcastAutoDigestWithArgsAndRetry()` with exponential backoff (5s → 60s max) - Retry on error - Fresh nonce query before each broadcast attempt - Context-aware cancellation support - Maximum 3 retries per digest run **Testing:** - Added 6 unit tests covering retry scenarios - Verified fresh nonce refetch on each attempt - Tests for timeout, cancellation, max retries, and transaction failures - Added build tag `//go:build kwiltest` to integration test - All 19 tests passing **Files Changed:** - `extensions/tn_digest/internal/engine_ops.go` - Core retry logic - `extensions/tn_digest/scheduler/scheduler.go` - Scheduler integration - `extensions/tn_digest/internal/engine_ops_test.go` - Comprehensive test coverage - `extensions/tn_digest/engine_ops_integration_test.go` - Build tag fix This ensures digest operations continue reliably even during network congestion or concurrent transaction activity, eliminating the nonce collision failure mode observed in production. resolves: trufnetwork/truf-network#1241
WalkthroughAdds a retry-capable broadcast path for auto_digest transactions that fetches a fresh nonce per attempt, integrates the retry call into the scheduler, guards an integration test with a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant S as Scheduler
participant E as EngineOperations
participant A as Accounts (nonce)
participant B as Broadcaster (fn)
participant N as Network
S->>E: BroadcastAutoDigestWithArgsAndRetry(ctx, dsn, signer, broadcastFn, ...)
loop retry attempts (<= max) and ctx not done
E->>A: Request fresh nonce
A-->>E: Nonce
E->>E: Build & sign tx with nonce + args
E->>B: call broadcastFn(ctx, tx, chainID)
alt broadcast success (OK code)
B-->>E: (hash, txResult, nil)
E->>E: Validate code, parse digest from logs
E-->>S: DigestTxResult (success)
else broadcast failure / timeout / wrong code
B-->>E: (hash?, txResult?, err)
E-->>E: Log attempt, backoff, prepare next attempt
end
opt ctx cancelled
E-->>S: return ctx error
end
end
alt exceeded max retries
E-->>S: return last error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Time Submission Status
|
Bug Report Checklist
|
|
@pr-time-tracker bug commit 0eaef43 && bug author @outerlook |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
extensions/tn_digest/engine_ops_integration_test.go(1 hunks)extensions/tn_digest/internal/engine_ops.go(2 hunks)extensions/tn_digest/internal/engine_ops_test.go(2 hunks)extensions/tn_digest/scheduler/scheduler.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/scheduler/scheduler.go (1)
extensions/tn_digest/scheduler/constants.go (3)
DigestDeleteCap(7-7)DigestExpectedRecordsPerStream(8-8)DigestPreservePastDays(9-9)
extensions/tn_digest/internal/engine_ops_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
EngineOperations(25-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: acceptance-test
- GitHub Check: lint
Resolves issue where digest operations failed to complete, leaving 596K+ pending prune days unprocessed. The digest scheduler halted midway through processing due to transaction nonce collisions during retries.
Problem:
Solution:
Implemented stateless retry logic that always refetches fresh nonce from database on each attempt. This automatically handles:
Implementation:
BroadcastAutoDigestWithArgsAndRetry()with exponential backoff (5s → 60s max)Testing:
//go:build kwiltestto integration testFiles Changed:
extensions/tn_digest/internal/engine_ops.go- Core retry logicextensions/tn_digest/scheduler/scheduler.go- Scheduler integrationextensions/tn_digest/internal/engine_ops_test.go- Comprehensive test coverageextensions/tn_digest/engine_ops_integration_test.go- Build tag fixThis ensures digest operations continue reliably even during network congestion or concurrent transaction activity, eliminating the nonce collision failure mode observed in production.
resolves: https://github.com/trufnetwork/truf-network/issues/1241
Summary by CodeRabbit