Semantic KV cache reuse for LLM inference engines.
SemBlend extends exact-prefix KV caching (LMCache, vLLM prefix cache, SGLang RadixAttention) with semantic donor discovery. When a new prompt is semantically similar to a prior one but lexically different — different instruction wording, sentence ordering, or template fields — SemBlend finds and reuses the cached KV tensors, eliminating redundant prefill computation.
Without SemBlend: vLLM + LMCache → 0% hit → full prefill every request
With SemBlend: vLLM + LMCache → 30–88% hit → reuse KV from similar past requests
| System | Hit Condition | Semantically Similar Prompts |
|---|---|---|
| vLLM prefix cache | Exact token-level prefix match | Full prefill (0% hit) |
| LMCache alone | Exact 256-token chunk match | Full prefill (0% hit) |
| LMCache + SemBlend | Semantic similarity ≥ 0.60 | Reuse donor KV (30–88% hit) |
SemBlend runs ~8ms of in-process MiniLM embedding + cosine search on every request. On a miss it adds negligible overhead. On a hit it replaces a 2–17 second prefill with sub-second KV retrieval.
Empirical results on A10G GPU, Qwen2.5-7B-AWQ + vLLM 0.14.1 + LMCache.
| Context Length | Cold TTFT | SemBlend Miss (+overhead) | SemBlend Hit | Hit Speedup |
|---|---|---|---|---|
| 4K tokens | 1,859 ms | 1,864 ms (+0.3%) | 801 ms | 2.3x |
| 8K tokens | 3,193 ms | 3,315 ms (+3.8%) | 817 ms | 3.9x |
| 16K tokens | 5,852 ms | 6,064 ms (+3.6%) | 871 ms | 6.7x |
| 32K tokens | 15,418 ms | — | 1,288 ms | 12.0x |
Hit TTFT is consistently ~800ms regardless of context length — it's bounded by KV retrieval, not prefill.
Miss overhead grows with context length (5–212ms at ~20 donors in store). Break-even: P_hit > 5% at 8K is net-positive.
| Workload | Requests | Hit Rate | Hit-only TTFT speedup |
|---|---|---|---|
| Short prompts (≥4K chars) | 250 | 29.2% | 1.63x |
| Long prompts (≥8K chars) | 150 | 30.0% | 1.88x |
Hit rate increases with cosine similarity: 17% at 0.50–0.60 → 60% at 0.90–1.00.
| Dataset | Hit Rate | Overall Speedup | Hit-only Speedup |
|---|---|---|---|
| CNN/DailyMail (summarization) | 50% | 2.29x | 2.39x |
| MultiNews (multi-doc summary) | 75% | 2.23x | 2.23x |
| SAMSum (dialogue summary) | 88% | 2.37x | 2.37x |
Speedups measured over cold vLLM+LMCache baseline (0% hit on semantically different requests).
| Turn | Hit Rate | TTFT Speedup |
|---|---|---|
| Turn 1 (cold) | — | 1.0x (baseline: 4019 ms) |
| Turn 2 | 99.5% | 5.1x (791 ms) |
| Turn 3 | 99.5% | 5.1x (787 ms) |
Multi-turn conversations naturally reuse the same prefix → near-perfect hit rates without any workload tuning.
| Engine | Hit Rate | p50 TTFT | Speedup |
|---|---|---|---|
| SGLang (RadixAttention, no SemBlend) | 0% | 2,265 ms | 0.98x |
| vLLM + LMCache (no SemBlend) | 0% | 3,311 ms | 1.0x |
| vLLM + LMCache + SemBlend | 100% | 1,026 ms | 3.26x |
| SGLang + SemBlend | 100% | 624 ms | 3.71x |
Each sample pair uses the same document and question but different instruction phrasings, shifting all chunk boundaries so exact-prefix caching gets 0% hits.
| Dataset | N | vLLM+SemBlend Hit% | Baseline Hit% | Hit-Only Speedup |
|---|---|---|---|---|
| LongEval (4K synthetic) | 999 | 82.6% | 22.7% | 1.48x |
| WikiText-103 | 230 | 75.7% | 0.8% | 1.55x |
| NarrativeQA | 365 | 29.6% | 18.6% | 2.09x |
| TriviaQA (Wikipedia QA) | 1,000 | 24.8% | 0.1% | 3.73x |
| SCBench (KV cache tasks) | 527 | 17.6% | 3.4% | 5.73x |
| Overall | 3,121 | 46.4% | 11.5% | — |
Baseline = SGLang vanilla (FP16, RadixCache only). SemBlend achieves 4x higher overall hit rate.
SemBlend injects donor KV with RoPE position correction. Output quality impact on high-hit workloads:
| Dataset | PPL ratio (SemBlend / cold) |
|---|---|
| CNN/DailyMail | 1.006 |
| WikiHow | 1.012 |
| XSum | 1.025 |
| MultiNews (hit-only runs) | 1.007 |
PPL ratio ≤ 1.025 on most datasets; elevated ratios (≥1.1) on low-hit runs are due to miss-run variance, not KV injection.
# Core (CPU-only: numpy + rapidfuzz)
pip install semblend
# With vLLM integration
pip install semblend[vllm]
# With SGLang integration
pip install semblend[sglang]
# With sentence-transformers embedder
pip install semblend[embedder]vLLM integrates via LMCache's KVConnectorBase_V1 — a first-class public API. No patching required.
pip install semblend[vllm] vllm lmcache
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--kv-transfer-config '{
"kv_connector": "SemBlendConnectorV1",
"kv_connector_module_path": "semblend.integration.vllm.connector_v1",
"kv_role": "kv_both"
}'Configure via environment variables:
| Variable | Default | Description |
|---|---|---|
SEMBLEND_ENABLED |
1 |
Enable semantic donor search |
SEMBLEND_MIN_SIMILARITY |
0.60 |
Cosine similarity threshold |
SEMBLEND_EMBEDDER |
minilm |
Embedder type (minilm, jaccard, onnx_gpu) |
SEMBLEND_FUZZY_CHUNKS |
0 |
Enable fuzzy chunk matching |
SGLang integrates via a RadixCache patch applied at startup. This is necessary because SGLang's RadixCache.match_prefix does not currently have a hook for semantic fallback lookup — a PR to SGLang is in progress to add a first-class SemanticPrefixProvider interface (analogous to LMCache PR #2803).
pip install semblend[sglang] sglang
# Option 1: CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 --port 8000
# Option 2: Programmatic — call before SGLang initializes
from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
import sglang as sgl
# ... start server ...Request → MiniLM Embed (5ms) → Cosine Search (1ms) → Align (1ms) → Inject KV
↓ ↓ ↓
384-dim vector Find similar donor Match chunk boundaries
in donor store via MD5 hash alignment
- Embed: Compute a 384-dim MiniLM-L6-v2 embedding (sliding-window for long docs)
- Search: Brute-force cosine similarity against the donor store (<1ms at 1K donors)
- Align: MD5 chunk hashing at 256-token boundaries finds reusable KV chunks
- Inject: Replace target token IDs with donor token IDs — LMCache/RadixCache finds cached KV
SemBlend is most effective for workloads where prompts share a large common context:
- Document Q&A / RAG: Same retrieved documents, different questions
- Summarization: Same article, different instruction phrasing
- Multi-turn dialogue: Conversation history grows across turns
- Code completion: Shared repository context across requests
SemBlend has minimal overhead on dissimilar workloads (e.g. code generation from scratch: measured 0% hit, 0.96x — 4% overhead from embedding + search).
Business Source License 1.1 (BSL-1.1). Free for non-production use including testing, development, evaluation, and academic research. Production use requires a commercial license from WorldFlow AI. Converts to Apache License 2.0 on 2030-03-16.
Contact: research@worldflowai.com