When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default.

Hi, I’m trying to understand the workflow for generating and using KV cache with large models (70B) and LMCache. I have a few questions:

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. So if I generate KV cache with Transformers and then try to decode it, I might run into vocab/GPU mismatch issues. Is that correct?

vLLM supports TP-aware prefill and generation, which solves the TP distribution problem. However, I noticed it currently does not support external past_key_values injection. Does this mean I cannot directly load KV cache generated by Transformers into vLLM for decoding?

For my workflow—generate KV cache → compress with CacheGen/LMCache → decode to test F1 score—what would be the recommended approach? Should I:

Stick with Transformers and implement TP-aware generate manually, or Use vLLM for both prefill and decode, even if I cannot reuse external KV cache?

withdrawchezingt The role of art in society
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions