fix(gemma2): Add quant_config to embedding layer for GGUF support by kitaekatt · Pull Request #30424 · vllm-project/vllm

kitaekatt · 2025-12-10T19:46:32Z

Summary

Adds quant_config parameter to VocabParallelEmbedding in Gemma2 model
Enables proper GGUF quantized embedding handling

Problem

Gemma2 GGUF models produce garbage output despite loading successfully:

Input: "Say hello in 5 words"
Output: " GHFW側から ThinkmariKeywords!");\rJahre Iliad幺个人..."

Root Cause

Without quant_config, VocabParallelEmbedding uses UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, interpreting them as float values.

This is the exact same bug that was fixed for DeepSeek in commit aa375dca9 ("#12836 - Missing quant_config in deepseek embedding layer").

The Fix

 self.embed_tokens = VocabParallelEmbedding(
     config.vocab_size,
     config.hidden_size,
+    quant_config=quant_config,
+    prefix=f"{prefix}.embed_tokens",
 )

Comparison

Model	quant_config Passed?	GGUF Works?
Gemma2	❌ NO (before fix)	❌ Garbage
Gemma3	✅ YES	✅ Works
Llama	✅ YES	✅ Works
DeepSeek	✅ YES (after #12836)	✅ Works

Affected Models

All Gemma2 GGUF variants:

gemma-2-2b-it-GGUF
gemma-2-9b-GGUF
gemma-2-27b-GGUF

Test Plan

Gemma2 GGUF model loads without errors
Inference produces coherent output (not garbage)
Compare output quality with Gemma3 GGUF or safetensors Gemma2

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request addresses a critical bug where Gemma2 GGUF models would produce garbage output. The root cause was correctly identified: the VocabParallelEmbedding layer was missing the quant_config parameter, causing quantized weights to be misinterpreted. The provided fix, which passes both quant_config and the layer prefix to the embedding layer, is correct and aligns with the implementation in other models.

While reviewing, I noticed that vllm/model_executor/models/gemma.py seems to have the same vulnerability. Applying a similar fix there would likely resolve the issue for Gemma1 GGUF models as well, ensuring consistent behavior across the Gemma model family. This is a great fix, well done!

chatgpt-codex-connector · 2025-12-15T20:30:38Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... Signed-off-by: Christina <truffle@gmail.com>

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt mentioned this pull request Dec 15, 2025

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

4 tasks

kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from d41b71f to 5837f8e Compare December 15, 2025 20:30

kitaekatt marked this pull request as ready for review December 15, 2025 20:30

kitaekatt marked this pull request as draft December 15, 2025 20:37

kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 5837f8e to 472b770 Compare December 29, 2025 20:43

kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 472b770 to 096b878 Compare January 19, 2026 17:27

kitaekatt force-pushed the fix/gemma2-embedding-quant-config branch from 096b878 to 530a043 Compare February 5, 2026 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

fix(gemma2): Add quant_config to embedding layer for GGUF support#30424

fix(gemma2): Add quant_config to embedding layer for GGUF support#30424
kitaekatt wants to merge 1 commit intovllm-project:mainfrom
kitaekatt:fix/gemma2-embedding-quant-config

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

kitaekatt commented Dec 10, 2025

Summary

Problem

Root Cause

The Fix

Comparison

Affected Models

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant