[data][llm] LLM Batch Inference Resiliency Thread

### Description

Hi team! I'd like to list a few of the things I'm seeing when scaling up Ray Data LLM to batch sizes of 10 million + prompts (each prompt itself is also quite long, often times over 10k tokens). Thought this would serve as a helpful place for discussion on the resiliency of Ray Data LLM at a large scale.

1) When using Ray Data LLM with the vLLM processor, there seems to be no actor level resiliency during vllm engine failures. When running at a high concurrency (for example 4 replicas of Qwen3-32B-FP8), if one of the vllm engine dies, I expect that the job will continue on the 3 remaining replicas while the 4th one tries to restart itself. 
As an example, I'm processing millions of prompts in a single rayjob via kuberay, this rayjob asks for 4 GPU instances (p5.48xlarge). Sometimes the engine dies randomly (see logs below) out of my control and shuts the whole job down. It would be very cool if it was possible to add resiliency at the engine failure level to restart a failed engine while continuing the batch on the existent vllm engines.

2) Another prevalent case here is especially for spot instances, it seems that if the spot node goes down, we lose a lot of objects in the object store memory and the job then also fails (will also attach logs). Is there a recommended way to deal with the storage of objects? I'm playing around with a very powerful CPU headnode and trying to allocate storage there but sometimes the spread of objects just distributes across all the nodes in the cluster. Also why would this kill a job?


Logs:
1) Random vllm engine failure logs:
```
- Map(_postprocess)->Write: Tasks: 0; Actors: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 252/252 [59:42<00:00, 12.0s/ row][A[A[A[A[A[A
- Map(_postprocess)->Write: Tasks: 0; Actors: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 252/252 [59:42<00:00, 14.2s/ row]
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7f6c58695f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7f6c58697ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #6: <unknown function> + 0xd8198 (0x7f6cb7589198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #7: <unknown function> + 0x94ac3 (0x7f6cb99d2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #8: clone + 0x44 (0x7f6cb9a63a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6cb517eeb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #1: <unknown function> + 0xe1c1a1 (0x7f6c5866f1a1 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #2: <unknown function> + 0x9468e6 (0x7f6c581998e6 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #3: <unknown function> + 0xd8198 (0x7f6cb7589198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #4: <unknown function> + 0x94ac3 (0x7f6cb99d2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #5: clone + 0x44 (0x7f6cb9a63a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m [rank4]:[E1215 14:21:16.386873484 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #1: <unknown function> + 0x111c7 (0x7fd66cb6c1c7 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fd610083640 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fd610092e28 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7fd610095f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7fd610097ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #6: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #7: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #8: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m terminate called after throwing an instance of 'c10::DistBackendError'
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m   what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #1: <unknown function> + 0x111c7 (0x7fd66cb6c1c7 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fd610083640 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fd610092e28 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7fd610095f48 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7fd610097ec2 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #6: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #7: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #8: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd66cad9eb0 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #1: <unknown function> + 0xe1c1a1 (0x7fd61006f1a1 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #2: <unknown function> + 0x9468e6 (0x7fd60fb998e6 in /home/ray/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #3: <unknown function> + 0xd8198 (0x7fd66ef3d198 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #4: <unknown function> + 0x94ac3 (0x7fd671386ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m frame #5: clone + 0x44 (0x7fd671417a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=74562, ip=192.168.92.185)[0m 
2025-12-15 14:21:17,179	ERROR exceptions.py:63 -- Exception occurred in user code, with the abbreviated stack trace below. By default, the Ray Data internal stack trace is omitted from stdout, and only written to the Ray Data log files at `/tmp/ray/session_2025-12-15_13-18-54_437845_1/logs/ray-data`. To output the full stack trace to stdout, set `DataContext.log_internal_stack_trace_to_stdout` to True.
Batch inference failed on attempt 1/3: [36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()[39m (pid=74562, ip=192.168.92.185, actor_id=6f12dc41d214b60dc4fa2cae02000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/base.py", line 166, in __call__
    async for output in self.udf(inputs):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 569, in udf
    request, output, time_taken_llm = await resp
                                      ^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 325, in generate_async
    output = await self._generate_async(request)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 409, in generate_async_v1
    async for request_output in stream:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 370, in generate
    q = await self.add_request(
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 276, in add_request
    raise EngineDeadError()
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

The above exception was the direct cause of the following exception:

[36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()[39m (pid=74562, ip=192.168.92.185, actor_id=6f12dc41d214b60dc4fa2cae02000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 552, in submit
    yield from _map_task(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 564, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 201, in _udf_timed_iter
    output = next(input)
             ^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 102, in __call__
    yield from self._post_process(results)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 84, in _shape_blocks
    for result in results:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 361, in _apply_transform
    yield from self._batch_fn(batches, ctx)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 810, in _transform
    raise items
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 731, in _reorder
    output_queue.put(await next_task)
                     ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 651, in _apply_udf
    return [out async for out in gen]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 651, in <listcomp>
    return [out async for out in gen]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 402, in _wrapped_udf_map_fn
    _try_wrap_udf_exception(e, item)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 443, in _try_wrap_udf_exception
    raise UserCodeException("UDF failed to process a data block.") from e
ray.exceptions.UserCodeException: UDF failed to process a data block.
```

2)
Object lost logs:
```
2025-12-12 12:16:09,975	ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2025-12-12 12:16:09,975	ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/plan.py", line 533, in execute
    blocks = execute_to_legacy_block_list(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 127, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 175, in _bundles_to_block_list
    bundle_list = list(bundles)
                  ^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 34, in __next__
    return self.get_next()
           ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 680, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 375, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 331, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 436, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 557, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 524, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 157, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 153, in on_data_ready
    ray.get(block_ref)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2962, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 1026, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ObjectLostError): [36mray::MapWorker(MapBatches(vLLMEngineStageUDF)).submit()[39m (pid=370, ip=192.168.6.4, actor_id=ca2ddefc2b91b390984d161002000000, repr=MapWorker(MapBatches(vLLMEngineStageUDF)))
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::MapWorker(MapBatches(TokenizeUDF)).submit()[39m (pid=255452, ip=192.168.105.197, actor_id=66ee436a0aedc57d7ee2d33d02000000, repr=MapWorker(MapBatches(TokenizeUDF)))
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::MapWorker(MapBatches(ChatTemplateUDF)).submit()[39m (pid=313288, ip=192.168.1.227, actor_id=4b90c5debffc67f074ade23402000000, repr=MapWorker(MapBatches(ChatTemplateUDF)))
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::Map(_preprocess)()[39m (pid=167973, ip=192.168.1.227)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectLostError: Failed to retrieve object 07cd5c469a796c96e2b5c0d2a97a2be95099ccfa0200000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

All copies of 07cd5c469a796c96e2b5c0d2a97a2be95099ccfa0200000002000000 have been lost due to node failure. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the failure.

```



### Use case

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][llm] LLM Batch Inference Resiliency Thread #59522

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[data][llm] LLM Batch Inference Resiliency Thread #59522

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions