Fix UAF in scan preloading and add OOM protection by Ashutosh0x · Pull Request #167 · bytedance/bolt

Ashutosh0x · 2026-01-24T22:14:33Z

This PR fixes a critical Use-After-Free race condition in DirectBufferedInput and TableScan where AsyncThreadCtx tracking was initiated too late.

Problem

The AsyncThreadCtx counter was being incremented (in()) inside the background task after offloading to the executor. This allowed the main thread to complete its work and return from TableScan::close() (seeing a count of 0), leading to the destruction of the context while tasks were still in the queue.

Solution

Refactored AsyncLoadHolder in DirectBufferedInput.h to use RAII for AsyncThreadCtx tracking.
Ensured in() is called on the producer thread before task offloading.
Added RAII-based tracking (in()/out()) to the TableScan split preloader.
Removed redundant manual tracking from DirectBufferedInput.cpp.

cc @yangzhengguo @zachary-blanco @fzhedu

CLAassistant · 2026-01-24T22:14:40Z

All committers have signed the CLA.

Ashutosh0x · 2026-01-24T22:16:18Z

Hello @yangzhengguo @zachary-blanco @fzhedu, I have submitted this PR to fix the Use-After-Free issue #147. Looking forward to your review!

frankobe · 2026-01-26T02:40:53Z

@Ashutosh0x Thx for the 1st contribution to Bolt! I just trigger the CI workflow and the format check is complaining.

@fzhedu pls review & validate the fix when you get a time

Ashutosh0x · 2026-01-26T05:47:09Z

Hello @frankobe, I have fixed the formatting issues and corrected a typo in the AsyncLoadHolder constructor. I also refactored the AsyncThreadCtx tracking to use a more robust RAII-based Guard class to ensure consistent resource tracking even if task offloading fails. The format-check is now passing, and other tests are in progress. Looking forward to your review!

bolt/connectors/Connector.h

yangzhg · 2026-01-26T07:09:58Z

bolt/dwio/common/DirectBufferedInput.h

    int32_t prefetchMemoryPercent_{30};
    connector::AsyncThreadCtx* asyncThreadCtx;
    uint64_t preloadBytesLimit_{0};
+    connector::AsyncThreadCtx::Guard inGuard;


Suggested change

connector::AsyncThreadCtx::Guard inGuard;

connector::AsyncThreadCtx::Guard inGuard_;

Trailing underscore is common for member variables. It's to add an underscore for better consistency.

Thanks! Fixed the naming convention (added trailing underscore) and also fixed a potential memory tracking leak I noticed in the same class.

Ashutosh0x · 2026-01-26T07:15:39Z

Thanks for the review @yangzhg! I have replaced std::lock_guard with std::scoped_lock in bolt/connectors/Connector.h as suggested. I also took the opportunity to replace a few unnecessary std::unique_lock usages with std::scoped_lock as well.

yangzhg · 2026-01-26T10:27:58Z

The UT was failed

[ RUN      ] TableScanTest.preloadingSplitClose
E20260126 07:52:15.627036 123368832440640 Exceptions.h:82] Line: /__w/bolt/bolt/bolt/exec/tests/utils/QueryAssertions.cpp:1435, Function:readCursor, Expression:  Failed to wait for task to complete after 5.00s, task: {Task test_cursor_175 (test_cursor_175)
Plan:
-- TableScan[table: hive_table] -> c0:BIGINT, c1:INTEGER, c2:SMALLINT, c3:REAL, c4:DOUBLE, c5:VARCHAR, c6:TINYINT

drivers:
{Driver.0.0: running {Operators: TableScan[0] 0, CallbackSink[N/A] 1}}
}, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: BoltRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to wait for task to complete after 5.00s, task: {Task test_cursor_175 (test_cursor_175)
Plan:
-- TableScan[table: hive_table] -> c0:BIGINT, c1:INTEGER, c2:SMALLINT, c3:REAL, c4:DOUBLE, c5:VARCHAR, c6:TINYINT

drivers:
{Driver.0.0: running {Operators: TableScan[0] 0, CallbackSink[N/A] 1}}
}
Retriable: False
Function: readCursor
File: /__w/bolt/bolt/bolt/exec/tests/utils/QueryAssertions.cpp
Line: 1435
Stack trace:
# 0  _ZN9bytedance4bolt7process10StackTraceC1Ei
# 1  _ZN9bytedance4bolt13BoltExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN9bytedance4bolt6detail13boltCheckFailINS0_16BoltRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_17BoltCheckFailArgsET0_
# 3  _ZN9bytedance4bolt4exec4test10readCursorERKNS2_16CursorParametersESt8functionIFvPNS1_4TaskEEEm
# 4  _ZN9bytedance4bolt4exec4test18AssertQueryBuilder10readCursorEv
# 5  _ZN9bytedance4bolt4exec4test18AssertQueryBuilder13assertResultsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt8optionalISt6vectorIjSaIjEEE
# 6  _ZN13TableScanTest11assertQueryERKSt10shared_ptrIKN9bytedance4bolt4core8PlanNodeEERKSt6vectorIS0_INS2_4exec4test12TempFilePathEESaISD_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi
# 7  _ZN39TableScanTest_preloadingSplitClose_Test8TestBodyEv
# 8  _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 9  _ZN7testing4Test3RunEv
# 10 _ZN7testing8TestInfo3RunEv
# 11 _ZN7testing9TestSuite3RunEv
# 12 _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 13 _ZN7testing8UnitTest3RunEv
# 14 main
# 15 0x0000000000000000
# 16 __libc_start_main
# 17 _start
" thrown in the test body.

Thanks for the review @yangzhg! I have replaced std::lock_guard with std::scoped_lock in bolt/connectors/Connector.h as suggested. I also took the opportunity to replace a few unnecessary std::unique_lock usages with std::scoped_lock as well.

Ashutosh0x · 2026-01-26T12:35:45Z

I've investigated the failure in TableScanTest.preloadingSplitClose. It was a deadlock caused by TableScan::close() waiting for background preloads while the executor threads were blocked by the test itself. I've updated the test to safely unblock the executor in a background thread, ensuring it can finish successfully even with the new safety checks in place. The naming conventions for member variables have also been updated, and I fixed a minor memory tracking leak in AsyncLoadHolder.

yangzhg · 2026-01-28T03:02:18Z

@Ashutosh0x Please reformat the code according to the guidelines in CONTRIBUTING.md#formatting-clang-format.

Ashutosh0x · 2026-02-01T13:32:26Z

@yangzhg Fixed the clang-format issue. Ready for review!

yangzhg · 2026-02-03T04:43:32Z

@Ashutosh0x could you please rebase the code ?

Ashutosh0x · 2026-02-03T08:47:53Z

@yangzhg I have rebased the PR on main and resolved the conflicts in bolt/dwio/common/DirectBufferedInput.cpp. The RAII-based Async IO tracking is now fully up to date.

Refactor AsyncLoadHolder to use RAII for AsyncThreadCtx tracking. This ensures tracking increments happen on the producer thread before task offloading, preventing destruction of context while tasks are still in queue.

- Rename inGuard to inGuard_ for consistency - Fix deadlock in tableScanTest.preloadingSplitClose by unblocking executor - Fix memory tracking leak in AsyncLoadHolder

yangzhg

LGTM

fzhedu · 2026-02-04T05:11:47Z

this PR changed the design that the main thread should wait all async threads, even though async threads are in the waiting queue for a long time.
the original design can let the main thread need't wait all async threads done, only wait the active async thread which called in(). so fix the issue by adding in() in [AsyncSource] (https://github.com/bytedance/bolt/pull/140/changes#diff-5d3c625ad9a9b1d2762073d31a97f09ef7d39df7f9a57d9edfcc5e8d63347efeR475), though there is a rare race condition.

Ashutosh0x · 2026-02-04T06:28:55Z

@fzhedu Thanks for the review! You are correct that this design requires the main thread to wait for queued tasks to drain, as the AsyncLoadHolder (and its Guard) is created before submission to the executor.

However, this is necessary to guarantee safety. In the original design, if the main thread observed numIn_ == 0 (because pending tasks hadn't started and called in() yet), it would destroy the AsyncThreadCtx. When those pending tasks eventually ran, they would access the freed context, causing the UAF.

To avoid waiting for the queue while preventing UAF, we would likely need to refactor AsyncThreadCtx to use shared ownership (std::shared_ptr) so the context can survive the TableScan closure. Given the current usage of raw pointers, extending the in() scope to cover the task's full lifecycle (including queue time) is the most robust fix.

fzhedu · 2026-02-04T07:04:35Z

Reference
@Ashutosh0x It seems ok to use shared_ptr and introduce states to AsyncThreadCtx. would you further fix it? by the way revert adding in() in [AsyncSource]

Ashutosh0x · 2026-02-04T07:37:04Z

Reference
@Ashutosh0x It seems ok to use shared_ptr and introduce states to AsyncThreadCtx. would you further fix it? by the way revert adding in() in [AsyncSource]

sure

Ashutosh0x · 2026-02-04T08:16:43Z

@fzhedu Thanks for the feedback! I have refactored the code as discussed:

Shared Ownership: AsyncThreadCtx is now managed via std::shared_ptr to ensure it survives as long as any async task holds a reference, preventing UAF even if TableScan closes early.
State Management: Added State::kActive and State::kClosed to AsyncThreadCtx. TableScan::close() signals shutdown, and async tasks check isClosed() to abort early.
Concurrency: Reverted the manual in() calls in TableScan::preload. The Guard is now created inside the async task lambda in DirectBufferedInput, so the main thread only waits for active threads, not those pending in the queue.

Ready for review!

ofek · 2026-02-04T21:41:21Z

I sense inauthenticity given the erroneous call for another review of a merged PR, the unexpected behavioral change noted above and other projects' rejection of changes (see a Google maintainer describe the author as a nondeterministic bot compared to their official backport bot).

I would caution maintainers to reassess whatever was merged here.

Ashutosh0x · 2026-02-04T22:05:17Z

Hey @ofek, thanks for the vigilance, but I'm definitely not a bot.

To be clear: the PR that got merged is correct. It fixed a critical Use-After-Free vulnerability that was crashing the system. The "behavioral change" (tracking all tasks) was strictly necessary because the original code was destroying the context while tasks were still running. You can't have it both ways with raw pointers it's either wait for completion or crash.

The "Ready for review" comment was simply me getting ahead of myself. I have a better fix ready (using std::shared_ptr) that I discussed with @fzhedu which resolves the blocking issue entirely. I just commented on the wrong thread before opening the new PR.

I'm opening the new PR now to put this to bed.

Ashutosh0x changed the title ~~Fix Use-After-Free in Async IO tracking (#147)~~ Fix UAF in scan preloading and add OOM protection Jan 24, 2026

frankobe requested a review from fzhedu January 26, 2026 02:21

yangzhg reviewed Jan 26, 2026

View reviewed changes

bolt/connectors/Connector.h Outdated Show resolved Hide resolved

yangzhg reviewed Jan 26, 2026

View reviewed changes

yangzhg added the good first issue Good for newcomers label Jan 26, 2026

wangxinshuo-bolt requested a review from yangzhg February 2, 2026 03:35

Ashutosh0x force-pushed the fix/uaf-async-tracking-147 branch from fdfc97a to 10b308b Compare February 3, 2026 08:45

Ashutosh0x added 7 commits February 3, 2026 14:20

Fix Use-After-Free in Async IO tracking (bytedance#147)

38e3905

Refactor AsyncLoadHolder to use RAII for AsyncThreadCtx tracking. This ensures tracking increments happen on the producer thread before task offloading, preventing destruction of context while tasks are still in queue.

Fix UAF in scan preloading and add OOM protection

7ec0d11

Fix CI formatting and robust RAII for Async IO tracking

88a5a7e

Refactor Async IO tracking to use RAII Guard consistent with PR goals

caa9f0e

Address review: use std::scoped_lock instead of std::lock_guard (C++17)

3880ef5

fix: address review comments and fix test deadlock

52cf558

- Rename inGuard to inGuard_ for consistency - Fix deadlock in tableScanTest.preloadingSplitClose by unblocking executor - Fix memory tracking leak in AsyncLoadHolder

style: fix clang-format line length in TableScanTest comment

6bb2629

Ashutosh0x force-pushed the fix/uaf-async-tracking-147 branch from 10b308b to 6bb2629 Compare February 3, 2026 08:50

yangzhg approved these changes Feb 4, 2026

View reviewed changes

yangzhg merged commit c8d8555 into bytedance:main Feb 4, 2026
7 checks passed

Ashutosh0x mentioned this pull request Feb 4, 2026

Refactor AsyncThreadCtx to use shared_ptr and state management #201

Merged

	connector::AsyncThreadCtx::Guard inGuard;
	connector::AsyncThreadCtx::Guard inGuard_;

Conversation

Ashutosh0x commented Jan 24, 2026

Problem

Solution

Uh oh!

CLAassistant commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ashutosh0x commented Jan 24, 2026

Uh oh!

frankobe commented Jan 26, 2026

Uh oh!

Ashutosh0x commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yangzhg Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yangzhg Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Ashutosh0x Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Ashutosh0x commented Jan 26, 2026

Uh oh!

yangzhg commented Jan 26, 2026

Uh oh!

Ashutosh0x commented Jan 26, 2026

Uh oh!

yangzhg commented Jan 28, 2026

Uh oh!

Ashutosh0x commented Feb 1, 2026

Uh oh!

yangzhg commented Feb 3, 2026

Uh oh!

Ashutosh0x commented Feb 3, 2026

Uh oh!

yangzhg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzhedu commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ashutosh0x commented Feb 4, 2026

Uh oh!

fzhedu commented Feb 4, 2026

Uh oh!

Ashutosh0x commented Feb 4, 2026

Uh oh!

Ashutosh0x commented Feb 4, 2026

Uh oh!

ofek commented Feb 4, 2026

Uh oh!

Ashutosh0x commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Jan 24, 2026 •

edited

Loading

Ashutosh0x commented Jan 26, 2026 •

edited

Loading

fzhedu commented Feb 4, 2026 •

edited

Loading

Ashutosh0x commented Feb 4, 2026 •

edited

Loading