Update pg_buffercache_pages.c to avoid crashes #217
Closed
+29
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This update is to avoid crashes when the database is under high buffermapping contention. Current code does not account for when a database is under heavy usage of buffers. Found that under heavy contention the spinlock can take too long resulting in a PANIC and a crash. Recommend implementing error handling where a timeout occurs and exponential backoff reduces number of requests for the buffer header after each failure.
Key improvements in this version:
Added a timeout mechanism using MAX_SPIN_ATTEMPTS
Uses TryLockBufHdr instead of spinning indefinitely
Implements exponential backoff up to 1ms when retrying lock acquisition
Yields to other processes every 1000 buffers
Includes better error reporting with buffer details
Checks for interrupts periodically to allow query cancellation
Some notes:
The MAX_SPIN_ATTEMPTS value might need adjustment based on your specific workload
The sleep durations (10μs per attempt, capped at 1ms) might need tuning
The yield interval (1000 buffers) could be adjusted based on system characteristics
This version is much more resilient to spinlock contention and less likely to cause system-wide issues under heavy load. It will fail gracefully rather than potentially hanging indefinitely.