Skip to content

Fixes hangs when Python future notifier is enabled#590

Merged
jameslamb merged 3 commits intorapidsai:mainfrom
pentschev:python-313-fix
Feb 13, 2026
Merged

Fixes hangs when Python future notifier is enabled#590
jameslamb merged 3 commits intorapidsai:mainfrom
pentschev:python-313-fix

Conversation

@pentschev
Copy link
Member

The test_many_servers_many_clients has been occasionally hanging for a long time when Python future notifier is enabled. Since Python 3.13.12, it started hanging almost 100% of the runs. Upon some investigation, it seems the hang comes from the notifier thread in notifier_thread.py:

  1. With Python future enabled, completed UCX operations are finished by the notifier: the notifier thread schedules _run_request_notifier(worker) on the event loop via run_coroutine_threadsafe. That coroutine calls worker.run_request_notifier(), which sets the asyncio futures that ep.send()/ep.recv() are awaiting.
  2. The notifier thread then does task.result(0.01) and, on timeout, calls task.cancel(). So whenever the coroutine doesn’t finish within 10 ms, it is cancelled.
  3. Under Python 3.13.12, asyncio’s callback ordering changed (asyncio.run_coroutine_threadsafe leaves underlying cancelled asyncio task running python/cpython#105836 / gh-105836: Fix asyncio.run_coroutine_threadsafe leaving underlying cancelled asyncio task running python/cpython#141696). The event loop often runs that scheduled coroutine later, so the 10 ms timeout is hit more often. The notifier thread then cancels the task, run_request_notifier() never runs, the asyncio futures for completed UCX requests are never set, and the test hangs in send/recv or in wait_listener_client_handlers.

This resolves the persistent hang in Python 3.13.12, and may as well resolve the hangs that occurred occasionally in those tests as well.

Closes #586

@pentschev pentschev self-assigned this Feb 13, 2026
@pentschev pentschev requested review from a team as code owners February 13, 2026 14:09
@pentschev pentschev requested a review from jameslamb February 13, 2026 14:09
@pentschev pentschev added bug Something isn't working non-breaking Introduces a non-breaking change ucxx labels Feb 13, 2026
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation and very clear description, thanks!

@jameslamb
Copy link
Member

I'll admin-merge this for you since only check-nightly-ci is failing. I'll work on getting that fixed today: rapidsai/shared-actions#94

@jameslamb jameslamb merged commit a0f47f8 into rapidsai:main Feb 13, 2026
77 of 79 checks passed
@pentschev
Copy link
Member Author

Thanks @wence- for review, and @jameslamb for review and also admin-merging, appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change ucxx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python future notifier hangs in Python 3.13.12

3 participants