SemBlend

Semantic KV cache reuse for LLM inference engines.

SemBlend extends exact-prefix KV caching (LMCache, vLLM prefix cache, SGLang RadixAttention) with semantic donor discovery. When a new prompt is semantically similar to a prior one but lexically different — different instruction wording, sentence ordering, or template fields — SemBlend finds and reuses the cached KV tensors, eliminating redundant prefill computation.

Without SemBlend:  vLLM + LMCache  →  0% hit  →  full prefill every request
With SemBlend:     vLLM + LMCache  →  30–88% hit  →  reuse KV from similar past requests

What It Does

System	Hit Condition	Semantically Similar Prompts
vLLM prefix cache	Exact token-level prefix match	Full prefill (0% hit)
LMCache alone	Exact 256-token chunk match	Full prefill (0% hit)
LMCache + SemBlend	Semantic similarity ≥ 0.60	Reuse donor KV (30–88% hit)

SemBlend runs ~8ms of in-process MiniLM embedding + cosine search on every request. On a miss it adds negligible overhead. On a hit it replaces a 2–17 second prefill with sub-second KV retrieval.

Performance

Empirical results on A10G GPU, Qwen2.5-7B-AWQ + vLLM 0.14.1 + LMCache.

TTFT speedup on hits vs. cold prefill (diverse real content)

Context Length	Cold TTFT	SemBlend Miss (+overhead)	SemBlend Hit	Hit Speedup
4K tokens	1,859 ms	1,864 ms (+0.3%)	801 ms	2.3x
8K tokens	3,193 ms	3,315 ms (+3.8%)	817 ms	3.9x
16K tokens	5,852 ms	6,064 ms (+3.6%)	871 ms	6.7x
32K tokens	15,418 ms	—	1,288 ms	12.0x

Hit TTFT is consistently ~800ms regardless of context length — it's bounded by KV retrieval, not prefill.

Miss overhead grows with context length (5–212ms at ~20 donors in store). Break-even: P_hit > 5% at 8K is net-positive.

Real-world hit rates (WildChat-1M user conversations)

Workload	Requests	Hit Rate	Hit-only TTFT speedup
Short prompts (≥4K chars)	250	29.2%	1.63x
Long prompts (≥8K chars)	150	30.0%	1.88x

Hit rate increases with cosine similarity: 17% at 0.50–0.60 → 60% at 0.90–1.00.

Cross-dataset hit rates and speedups (8K tokens, Qwen2.5-7B)

Dataset	Hit Rate	Overall Speedup	Hit-only Speedup
CNN/DailyMail (summarization)	50%	2.29x	2.39x
MultiNews (multi-doc summary)	75%	2.23x	2.23x
SAMSum (dialogue summary)	88%	2.37x	2.37x

Speedups measured over cold vLLM+LMCache baseline (0% hit on semantically different requests).

Multi-turn dialogue (4K context, 3 turns)

Turn	Hit Rate	TTFT Speedup
Turn 1 (cold)	—	1.0x (baseline: 4019 ms)
Turn 2	99.5%	5.1x (791 ms)
Turn 3	99.5%	5.1x (787 ms)

Multi-turn conversations naturally reuse the same prefix → near-perfect hit rates without any workload tuning.

SGLang comparison (8K, cross-instruction RAG workload)

Engine	Hit Rate	p50 TTFT	Speedup
SGLang (RadixAttention, no SemBlend)	0%	2,265 ms	0.98x
vLLM + LMCache (no SemBlend)	0%	3,311 ms	1.0x
vLLM + LMCache + SemBlend	100%	1,026 ms	3.26x
SGLang + SemBlend	100%	624 ms	3.71x

Extended benchmark suite (cross-instruction pairing, 6,905 samples)

Each sample pair uses the same document and question but different instruction phrasings, shifting all chunk boundaries so exact-prefix caching gets 0% hits.

Dataset	N	vLLM+SemBlend Hit%	Baseline Hit%	Hit-Only Speedup
LongEval (4K synthetic)	999	82.6%	22.7%	1.48x
WikiText-103	230	75.7%	0.8%	1.55x
NarrativeQA	365	29.6%	18.6%	2.09x
TriviaQA (Wikipedia QA)	1,000	24.8%	0.1%	3.73x
SCBench (KV cache tasks)	527	17.6%	3.4%	5.73x
Overall	3,121	46.4%	11.5%	—

Baseline = SGLang vanilla (FP16, RadixCache only). SemBlend achieves 4x higher overall hit rate.

Quality

SemBlend injects donor KV with RoPE position correction. Output quality impact on high-hit workloads:

Dataset	PPL ratio (SemBlend / cold)
CNN/DailyMail	1.006
WikiHow	1.012
XSum	1.025
MultiNews (hit-only runs)	1.007

PPL ratio ≤ 1.025 on most datasets; elevated ratios (≥1.1) on low-hit runs are due to miss-run variance, not KV injection.

Installation

# Core (CPU-only: numpy + rapidfuzz)
pip install semblend

# With vLLM integration
pip install semblend[vllm]

# With SGLang integration
pip install semblend[sglang]

# With sentence-transformers embedder
pip install semblend[embedder]

Quick Start: vLLM + LMCache

vLLM integrates via LMCache's KVConnectorBase_V1 — a first-class public API. No patching required.

pip install semblend[vllm] vllm lmcache

vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
  --kv-transfer-config '{
    "kv_connector": "SemBlendConnectorV1",
    "kv_connector_module_path": "semblend.integration.vllm.connector_v1",
    "kv_role": "kv_both"
  }'

Configure via environment variables:

Variable	Default	Description
`SEMBLEND_ENABLED`	`1`	Enable semantic donor search
`SEMBLEND_MIN_SIMILARITY`	`0.60`	Cosine similarity threshold
`SEMBLEND_EMBEDDER`	`minilm`	Embedder type (`minilm`, `jaccard`, `onnx_gpu`)
`SEMBLEND_FUZZY_CHUNKS`	`0`	Enable fuzzy chunk matching

Quick Start: SGLang

SGLang integrates via a RadixCache patch applied at startup. This is necessary because SGLang's RadixCache.match_prefix does not currently have a hook for semantic fallback lookup — a PR to SGLang is in progress to add a first-class SemanticPrefixProvider interface (analogous to LMCache PR #2803).

pip install semblend[sglang] sglang

# Option 1: CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 --port 8000

# Option 2: Programmatic — call before SGLang initializes
from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
import sglang as sgl
# ... start server ...

How It Works

Request → MiniLM Embed (5ms) → Cosine Search (1ms) → Align (1ms) → Inject KV
              ↓                       ↓                    ↓
         384-dim vector        Find similar donor    Match chunk boundaries
                                in donor store       via MD5 hash alignment

Embed: Compute a 384-dim MiniLM-L6-v2 embedding (sliding-window for long docs)
Search: Brute-force cosine similarity against the donor store (<1ms at 1K donors)
Align: MD5 chunk hashing at 256-token boundaries finds reusable KV chunks
Inject: Replace target token IDs with donor token IDs — LMCache/RadixCache finds cached KV

When SemBlend Helps

SemBlend is most effective for workloads where prompts share a large common context:

Document Q&A / RAG: Same retrieved documents, different questions
Summarization: Same article, different instruction phrasing
Multi-turn dialogue: Conversation history grows across turns
Code completion: Shared repository context across requests

SemBlend has minimal overhead on dissimilar workloads (e.g. code generation from scratch: measured 0% hit, 0.96x — 4% overhead from embedding + search).

License

Business Source License 1.1 (BSL-1.1). Free for non-production use including testing, development, evaluation, and academic research. Production use requires a commercial license from WorldFlow AI. Converts to Apache License 2.0 on 2030-03-16.

Contact: research@worldflowai.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
semblend		semblend
semblend_core		semblend_core
synapse_kv_connector		synapse_kv_connector
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemBlend

What It Does

Performance

TTFT speedup on hits vs. cold prefill (diverse real content)

Real-world hit rates (WildChat-1M user conversations)

Cross-dataset hit rates and speedups (8K tokens, Qwen2.5-7B)

Multi-turn dialogue (4K context, 3 turns)

SGLang comparison (8K, cross-instruction RAG workload)

Extended benchmark suite (cross-instruction pairing, 6,905 samples)

Quality

Installation

Quick Start: vLLM + LMCache

Quick Start: SGLang

How It Works

When SemBlend Helps

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemBlend

What It Does

Performance

TTFT speedup on hits vs. cold prefill (diverse real content)

Real-world hit rates (WildChat-1M user conversations)

Cross-dataset hit rates and speedups (8K tokens, Qwen2.5-7B)

Multi-turn dialogue (4K context, 3 turns)

SGLang comparison (8K, cross-instruction RAG workload)

Extended benchmark suite (cross-instruction pairing, 6,905 samples)

Quality

Installation

Quick Start: vLLM + LMCache

Quick Start: SGLang

How It Works

When SemBlend Helps

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages