-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hi, I’m trying to understand the workflow for generating and using KV cache with large models (70B) and LMCache. I have a few questions:
When using Transformers’ .generate() with Tensor Parallel (TP), it seems the method is not TP-aware by default. So if I generate KV cache with Transformers and then try to decode it, I might run into vocab/GPU mismatch issues. Is that correct?
vLLM supports TP-aware prefill and generation, which solves the TP distribution problem. However, I noticed it currently does not support external past_key_values injection. Does this mean I cannot directly load KV cache generated by Transformers into vLLM for decoding?
For my workflow—generate KV cache → compress with CacheGen/LMCache → decode to test F1 score—what would be the recommended approach? Should I:
Stick with Transformers and implement TP-aware generate manually, or Use vLLM for both prefill and decode, even if I cannot reuse external KV cache?
withdrawchezingt The role of art in society
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [237,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.