Hi team! I'd like to list a few of the things I'm seeing when scaling up Ray Data LLM to batch sizes of 10 million + prompts (each prompt itself is also quite long, often times over 10k tokens). Thought this would serve as a helpful place for discussion on the resiliency of Ray Data LLM at a large scale.
- Map(_postprocess)->Write: Tasks: 0; Actors: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 252/252 [59:42<00:00, 12.0s/ row]�[A�[A�[A�[A�[A�[A
- Map(_postprocess)->Write: Tasks: 0; Actors: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 252/252 [59:42<00:00, 14.2s/ row]
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7f6c58695f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7f6c58697ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #6: <unknown function> + 0xd8198 (0x7f6cb7589198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #7: <unknown function> + 0x94ac3 (0x7f6cb99d2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #8: clone + 0x44 (0x7f6cb9a63a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6cb517eeb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #1: <unknown function> + 0xe1c1a1 (0x7f6c5866f1a1 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #2: <unknown function> + 0x9468e6 (0x7f6c581998e6 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #3: <unknown function> + 0xd8198 (0x7f6cb7589198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #4: <unknown function> + 0x94ac3 (0x7f6cb99d2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #5: clone + 0x44 (0x7f6cb9a63a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m [rank4]:[E1215 14:21:16.386873484 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #1: <unknown function> + 0x111c7 (0x7fd66cb6c1c7 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fd610083640 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fd610092e28 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7fd610095f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7fd610097ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #6: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #7: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #8: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m terminate called after throwing an instance of 'c10::DistBackendError'
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m what(): [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #1: <unknown function> + 0x111c7 (0x7fd66cb6c1c7 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fd610083640 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fd610092e28 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7fd610095f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7fd610097ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #6: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #7: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #8: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #1: <unknown function> + 0xe1c1a1 (0x7fd61006f1a1 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #2: <unknown function> + 0x9468e6 (0x7fd60fb998e6 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #3: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #4: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m frame #5: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)�[0m
2025-12-15 14:21:17,179 ERROR exceptions.py:63 -- Exception occurred in user code, with the abbreviated stack trace below. By default, the Ray Data internal stack trace is omitted from stdout, and only written to the Ray Data log files at `/tmp/ray/session_2025-12-15_13-18-54_437845_1/logs/ray-data`. To output the full stack trace to stdout, set `DataContext.log_internal_stack_trace_to_stdout` to True.
Batch inference failed on attempt 1/3: �[36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()�[39m (pid=74562, ip=192.168.92.185, actor_id=6f12dc41d214b60dc4fa2cae02000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/base.py", line 166, in __call__
async for output in self.udf(inputs):
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 569, in udf
request, output, time_taken_llm = await resp
^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
return f.result() # May raise f.exception().
^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 325, in generate_async
output = await self._generate_async(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 409, in generate_async_v1
async for request_output in stream:
File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 370, in generate
q = await self.add_request(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
raise EngineDeadError()
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
The above exception was the direct cause of the following exception:
�[36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()�[39m (pid=74562, ip=192.168.92.185, actor_id=6f12dc41d214b60dc4fa2cae02000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 552, in submit
yield from _map_task(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 564, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 201, in _udf_timed_iter
output = next(input)
^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 102, in __call__
yield from self._post_process(results)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 84, in _shape_blocks
for result in results:
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 361, in _apply_transform
yield from self._batch_fn(batches, ctx)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 810, in _transform
raise items
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 731, in _reorder
output_queue.put(await next_task)
^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 651, in _apply_udf
return [out async for out in gen]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 651, in <listcomp>
return [out async for out in gen]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 402, in _wrapped_udf_map_fn
_try_wrap_udf_exception(e, item)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 443, in _try_wrap_udf_exception
raise UserCodeException("UDF failed to process a data block.") from e
ray.exceptions.UserCodeException: UDF failed to process a data block.
2025-12-12 12:16:09,975 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2025-12-12 12:16:09,975 ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/exceptions.py", line 49, in handle_trace
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/plan.py", line 533, in execute
blocks = execute_to_legacy_block_list(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 127, in execute_to_legacy_block_list
block_list = _bundles_to_block_list(bundles)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 175, in _bundles_to_block_list
bundle_list = list(bundles)
^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 34, in __next__
return self.get_next()
^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 680, in get_next
bundle = state.get_output_blocking(output_split_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 375, in get_output_blocking
raise self._exception
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 331, in run
continue_sched = self._scheduling_loop_step(self._topology)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 436, in _scheduling_loop_step
num_errored_blocks = process_completed_tasks(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 557, in process_completed_tasks
raise e from None
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 524, in process_completed_tasks
bytes_read = task.on_data_ready(
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 157, in on_data_ready
raise ex from None
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 153, in on_data_ready
ray.get(block_ref)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2962, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ObjectLostError): [36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()[39m (pid=370, ip=192.168.6.4, actor_id=ca2ddefc2b91b390984d161002000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::MapWorker(MapBatches(TokenizeUDF)).submit()[39m (pid=255452, ip=192.168.105.197, actor_id=66ee436a0aedc57d7ee2d33d02000000, repr=MapWorker(MapBatches(TokenizeUDF)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::MapWorker(MapBatches(ChatTemplateUDF)).submit()[39m (pid=313288, ip=192.168.1.227, actor_id=4b90c5debffc67f074ade23402000000, repr=MapWorker(MapBatches(ChatTemplateUDF)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::Map(_preprocess)()[39m (pid=167973, ip=192.168.1.227)
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectLostError: Failed to retrieve object 07cd5c469a796c96e2b5c0d2a97a2be95099ccfa0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
All copies of 07cd5c469a796c96e2b5c0d2a97a2be95099ccfa0200000002000000 have been lost due to node failure. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the failure.
Description
Hi team! I'd like to list a few of the things I'm seeing when scaling up Ray Data LLM to batch sizes of 10 million + prompts (each prompt itself is also quite long, often times over 10k tokens). Thought this would serve as a helpful place for discussion on the resiliency of Ray Data LLM at a large scale.
When using Ray Data LLM with the vLLM processor, there seems to be no actor level resiliency during vllm engine failures. When running at a high concurrency (for example 4 replicas of Qwen3-32B-FP8), if one of the vllm engine dies, I expect that the job will continue on the 3 remaining replicas while the 4th one tries to restart itself.
As an example, I'm processing millions of prompts in a single rayjob via kuberay, this rayjob asks for 4 GPU instances (p5.48xlarge). Sometimes the engine dies randomly (see logs below) out of my control and shuts the whole job down. It would be very cool if it was possible to add resiliency at the engine failure level to restart a failed engine while continuing the batch on the existent vllm engines.
Another prevalent case here is especially for spot instances, it seems that if the spot node goes down, we lose a lot of objects in the object store memory and the job then also fails (will also attach logs). Is there a recommended way to deal with the storage of objects? I'm playing around with a very powerful CPU headnode and trying to allocate storage there but sometimes the spread of objects just distributes across all the nodes in the cluster. Also why would this kill a job?
Logs:
Object lost logs:
Use case
No response