Skip to content

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. #7

@nnbcccscdscdsc

Description

@nnbcccscdscdsc

Hi, I’m trying to understand the workflow for generating and using KV cache with large models (70B) and LMCache. I have a few questions:

When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. So if I generate KV cache with Transformers and then try to decode it, I might run into vocab/GPU mismatch issues. Is that correct?

vLLM supports TP-aware prefill and generation, which solves the TP distribution problem. However, I noticed it currently does not support external past_key_values injection. Does this mean I cannot directly load KV cache generated by Transformers into vLLM for decoding?

For my workflow—generate KV cache → compress with CacheGen/LMCache → decode to test F1 score—what would be the recommended approach? Should I:

Stick with Transformers and implement TP-aware generate manually, or Use vLLM for both prefill and decode, even if I cannot reuse external KV cache?

withdrawchezingt The role of art in society
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions