Skip to content

fix(query): deleteOutdatedRecords transaction handling#5656

Open
agusayerza wants to merge 2 commits intomasterfrom
agus/nan-5045-improve-delete-outdated-record-query
Open

fix(query): deleteOutdatedRecords transaction handling#5656
agusayerza wants to merge 2 commits intomasterfrom
agus/nan-5045-improve-delete-outdated-record-query

Conversation

@agusayerza
Copy link
Contributor

@agusayerza agusayerza commented Mar 17, 2026

Fix deleteOutdatedRecords transaction handling

Fixes NAN-5045

Problem

We've been seeing a major increase in persist errors ("Failed to save record" / "Failed to delete record") rooted in DB timeouts, with spikes correlating to high DB stress (CPU >85%, block read times 10-20min, IO:DataFileRead wait events).

The main culprit identified via Datadog APM is the DELETE .../outdated endpoint (deleteOutdatedRecords): 57.8k requests, 4.43s p95 latency spiking to 24s, 0.4% error rate

The issue is structural:

  1. Single transaction wrapping all batches: for a large customer with 100k outdated records, that's 20 batches of 5000 rows all within one transaction. The DB connection is held for the entire 4-24s duration, and all index updates (4 indexes per row) generate sustained IO/WAL pressure that competes with normal sync operations.
  2. Large batch size (5000) — amplifies all of the above per batch.

Other things to consider:

  1. No advisory lock: both upsert() and deleteRecords() acquire pg_advisory_xact_lock, this function did not. As we are just removing old records, Im not sure if this is necessary tbh
  2. No deadlock retry: copied upsert() which retries on PostgreSQL deadlock errors (40P01), a deadlock here would fail the entire operation (once again, not critical and not something I've seen happen but it would fail the sync)

Fix

packages/records/lib/models/records.ts deleteOutdatedRecords:

  • Move the transaction inside the batch loop: each batch of records is now its own independent transaction that acquires the lock, updates rows, updates counts, and commits. This releases the DB connection and all locks between batches, allowing other queries to interleave naturally.
  • Add pg_advisory_xact_lock using the same newLockId(connectionId, model) as upsert() and deleteRecords(), ensuring clean serialization.
  • Add deadlock retry (3 attempts, 500ms delay) matching the upsert() pattern.
  • Reduce default batch size from 5000 to 1000.

The update also introduces partial-progress handling so that if a later batch fails, already-deleted records can still be returned successfully with appropriate logging.


This summary was automatically generated by @propel-code-bot

@linear
Copy link

linear bot commented Mar 17, 2026

return db.transaction(async (trx) => {
const now = trx.fn.now(6);
// Lock to prevent concurrent modifications with upserts and deletes
await trx.raw(`SELECT pg_advisory_xact_lock(?) as lock_records_outdated`, [newLockId(connectionId, model)]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if the lock is really needed to delete stale records

propel-code-bot[bot]

This comment was marked as outdated.

Copy link
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested guarding error checks to avoid TypeError and preserve retry logic.

Status: Changes Suggested | Risk: Medium

Issues Identified & Suggestions
  • Guard error object before 'code' in err check to avoid TypeError: packages/records/lib/models/records.ts
Review Details

📁 1 files reviewed | 💬 1 comments

Instruction Files
└── .claude/
    ├── agents/
    │   └── nango-docs-migrator.md
    └── skills/
        ├── agent-builder-skill/
        │   ├── EXAMPLES.md
        │   └── SKILL.md
        ├── creating-integration-docs/
        │   └── SKILL.md
        └── creating-skills-skill/
            └── SKILL.md

👍 / 👎 individual comments to help improve reviews for you

@agusayerza agusayerza requested a review from a team March 17, 2026 21:13
span.setTag('error', err);
if (deletedIds.length > 0) {
logger.error(`${e.message}. Partial progress: ${deletedIds.length} records deleted before failure.`);
return Ok(deletedIds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's problematic. Sync will be considered successful and webhook sent to customer accordingly even though some of the records that are outdated haven't been deleted.

const e = new Error(`Failed to mark previous generation records as deleted for connection ${connectionId}, model ${model}, generation ${generation}`, {
cause: err
});
span.setTag('error', err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize here we are passing err instead of e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants