Skip to content

fix: digest failing to prune pending days due to nonce errors#1188

Merged
MicBun merged 2 commits intomainfrom
retryTnDigest
Oct 2, 2025
Merged

fix: digest failing to prune pending days due to nonce errors#1188
MicBun merged 2 commits intomainfrom
retryTnDigest

Conversation

@MicBun
Copy link
Member

@MicBun MicBun commented Oct 2, 2025

Resolves issue where digest operations failed to complete, leaving 596K+ pending prune days unprocessed. The digest scheduler halted midway through processing due to transaction nonce collisions during retries.

Problem:

  • Digest stopped processing on Sept 30, leaving days 20353-20361+ pending
  • 596,375 days accumulated in pending_prune_days table
  • Root cause: broadcast timeout → retry used same nonce → "transaction already exists"
  • Concurrent transactions from same account caused nonce conflicts
  • System required manual intervention to resume

Solution:
Implemented stateless retry logic that always refetches fresh nonce from database on each attempt. This automatically handles:

  • Network timeouts
  • Concurrent transaction activity
  • Nonce gaps and database state changes

Implementation:

  • Added BroadcastAutoDigestWithArgsAndRetry() with exponential backoff (5s → 60s max)
  • Retry on error
  • Fresh nonce query before each broadcast attempt
  • Context-aware cancellation support
  • Maximum 3 retries per digest run

Testing:

  • Added 6 unit tests covering retry scenarios
  • Verified fresh nonce refetch on each attempt
  • Tests for timeout, cancellation, max retries, and transaction failures
  • Added build tag //go:build kwiltest to integration test
  • All 19 tests passing

Files Changed:

  • extensions/tn_digest/internal/engine_ops.go - Core retry logic
  • extensions/tn_digest/scheduler/scheduler.go - Scheduler integration
  • extensions/tn_digest/internal/engine_ops_test.go - Comprehensive test coverage
  • extensions/tn_digest/engine_ops_integration_test.go - Build tag fix

This ensures digest operations continue reliably even during network congestion or concurrent transaction activity, eliminating the nonce collision failure mode observed in production.

resolves: https://github.com/trufnetwork/truf-network/issues/1241

Summary by CodeRabbit

  • New Features
    • Auto-digest broadcasts now retry on failure with exponential backoff and a fresh nonce each attempt, respecting cancellation. Scheduler uses the retry-enabled flow with a fixed retry limit and clearer post-retry error messages for improved robustness.
  • Tests
    • Added comprehensive tests covering immediate success, retry paths, max-retry handling, context cancellation, nonce management, and failure scenarios. Test file now requires an opt-in build tag for integration runs.

Resolves issue where digest operations failed to complete, leaving 596K+
pending prune days unprocessed. The digest scheduler halted midway through
processing due to transaction nonce collisions during retries.

**Problem:**
- Digest stopped processing on Sept 30, leaving days 20353-20361+ pending
- 596,375 days accumulated in pending_prune_days table
- Root cause: broadcast timeout → retry used same nonce → "transaction already exists"
- Concurrent transactions from same account caused nonce conflicts
- System required manual intervention to resume

**Solution:**
Implemented stateless retry logic that always refetches fresh nonce from
database on each attempt. This automatically handles:
- Network timeouts
- Concurrent transaction activity
- Nonce gaps and database state changes

**Implementation:**
- Added `BroadcastAutoDigestWithArgsAndRetry()` with exponential backoff (5s → 60s max)
- Retry on error
- Fresh nonce query before each broadcast attempt
- Context-aware cancellation support
- Maximum 3 retries per digest run

**Testing:**
- Added 6 unit tests covering retry scenarios
- Verified fresh nonce refetch on each attempt
- Tests for timeout, cancellation, max retries, and transaction failures
- Added build tag `//go:build kwiltest` to integration test
- All 19 tests passing

**Files Changed:**
- `extensions/tn_digest/internal/engine_ops.go` - Core retry logic
- `extensions/tn_digest/scheduler/scheduler.go` - Scheduler integration
- `extensions/tn_digest/internal/engine_ops_test.go` - Comprehensive test coverage
- `extensions/tn_digest/engine_ops_integration_test.go` - Build tag fix

This ensures digest operations continue reliably even during network
congestion or concurrent transaction activity, eliminating the nonce
collision failure mode observed in production.

resolves: trufnetwork/truf-network#1241
@MicBun MicBun requested a review from outerlook October 2, 2025 07:36
@MicBun MicBun self-assigned this Oct 2, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 2, 2025

Walkthrough

Adds a retry-capable broadcast path for auto_digest transactions that fetches a fresh nonce per attempt, integrates the retry call into the scheduler, guards an integration test with a kwiltest build tag, and adds unit tests exercising retry, nonce handling, and context cancellation.

Changes

Cohort / File(s) Summary
Retry-enabled auto_digest broadcast
extensions/tn_digest/internal/engine_ops.go
Adds EngineOperations.BroadcastAutoDigestWithArgsAndRetry with exponential backoff, context cancellation handling, retry loop and logging. Introduces internal broadcastAutoDigestWithFreshNonce helper to fetch fresh nonce, build/sign, broadcast, validate result code, and parse digest logs.
Scheduler integration
extensions/tn_digest/scheduler/scheduler.go
Replaces BroadcastAutoDigestWithArgsAndParse with BroadcastAutoDigestWithArgsAndRetry, supplies a retry limit (3), and updates the error log message to reflect retries.
Unit tests for retry flow
extensions/tn_digest/internal/engine_ops_test.go
Adds mocks and extensive tests for BroadcastAutoDigestWithArgsAndRetry: immediate success, retry-on-error, max-retries exceeded, fresh-nonce per attempt, context cancellation, and transaction failure parsing.
Build tag for integration test
extensions/tn_digest/engine_ops_integration_test.go
Adds //go:build kwiltest build tag to include the integration test only when the kwiltest tag is used; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Scheduler
  participant E as EngineOperations
  participant A as Accounts (nonce)
  participant B as Broadcaster (fn)
  participant N as Network

  S->>E: BroadcastAutoDigestWithArgsAndRetry(ctx, dsn, signer, broadcastFn, ...)
  loop retry attempts (<= max) and ctx not done
    E->>A: Request fresh nonce
    A-->>E: Nonce
    E->>E: Build & sign tx with nonce + args
    E->>B: call broadcastFn(ctx, tx, chainID)
    alt broadcast success (OK code)
      B-->>E: (hash, txResult, nil)
      E->>E: Validate code, parse digest from logs
      E-->>S: DigestTxResult (success)
    else broadcast failure / timeout / wrong code
      B-->>E: (hash?, txResult?, err)
      E-->>E: Log attempt, backoff, prepare next attempt
    end
    opt ctx cancelled
      E-->>S: return ctx error
    end
  end
  alt exceeded max retries
    E-->>S: return last error
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • outerlook

Poem

I twitch my whiskers at each retry,
Fresh nonce hops in, we give it a try.
Backoff drumbeat, patient and spry,
Logs tell the tale, we parse the sky.
Three brave hops — if not, we sigh. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the primary fix—addressing digest prune failures caused by nonce errors—and directly reflects the core change in the PR, making it specific and relevant to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch retryTnDigest

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03ed687 and 78c9489.

📒 Files selected for processing (1)
  • extensions/tn_digest/internal/engine_ops.go (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint
  • GitHub Check: acceptance-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@holdex
Copy link

holdex bot commented Oct 2, 2025

Time Submission Status

Member Status Time Action Last Update
MicBun ✅ Submitted 6h Update time Oct 2, 2025, 10:55 AM
@outerlook ❌ Missing - ⚠️ Submit time -

@holdex
Copy link

holdex bot commented Oct 2, 2025

Bug Report Checklist

Status Commit Link Bug Author
✅ Submitted commit link @outerlook

@MicBun
Copy link
Member Author

MicBun commented Oct 2, 2025

@pr-time-tracker bug commit 0eaef43 && bug author @outerlook

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12405f9 and 03ed687.

📒 Files selected for processing (4)
  • extensions/tn_digest/engine_ops_integration_test.go (1 hunks)
  • extensions/tn_digest/internal/engine_ops.go (2 hunks)
  • extensions/tn_digest/internal/engine_ops_test.go (2 hunks)
  • extensions/tn_digest/scheduler/scheduler.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
extensions/tn_digest/scheduler/scheduler.go (1)
extensions/tn_digest/scheduler/constants.go (3)
  • DigestDeleteCap (7-7)
  • DigestExpectedRecordsPerStream (8-8)
  • DigestPreservePastDays (9-9)
extensions/tn_digest/internal/engine_ops_test.go (1)
extensions/tn_digest/internal/engine_ops.go (1)
  • EngineOperations (25-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: acceptance-test
  • GitHub Check: lint

@MicBun MicBun merged commit 8942193 into main Oct 2, 2025
7 of 8 checks passed
@MicBun MicBun deleted the retryTnDigest branch October 2, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants