Skip to content
View kitaekatt's full-sized avatar

Block or report kitaekatt

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
kitaekatt/README.md

Hi, I'm Christina Norman

I'm a 20+ year games industry veteran (BioWare - Mass Effect 1-3, Riot Games - League of Legends, Wild Rift) and entrepreneur (Founder: Elodie Games) My strengths include game engineering, game design, and studio leadership.

Current Focus

  • Consultant focused on increasing developer productivity by creating AI-enhanced workflows (Amazon, Spryfox, Roboto Games)
  • Improving performance of locally hosted LLM infrastructure (VLLM), GGUF model support, memory optimization, inference performance.

Articles

Articles I've written on Claude Code and developer productivity.

Claude and Open Source Contributions

Claude Code

As an active contributor to Anthropic's Claude Code project, I've filed feature requests and submitted bug reports that have resulted in tangible improvements to the developer experience.

2 feature requests implemented
  1. #19541 — Per-terminal session affinity for --continue. --continue resumed the most recent session globally, breaking multi-terminal workflows — restarting in one terminal would pick up a different terminal's session. Filed a proposal with a terminal identifier priority table covering iTerm, Kitty, Windows Terminal, tmux, and others. Sessions now display a resume command with session ID on exit (e.g. claude --resume <session-id>), giving users explicit control over which session to continue.
  2. #13412 — "Shell cwd was reset" message noise. Users working across multiple repositories from a central config repo saw this message after every Bash command run outside the project root, making output hard to read. Filed a request to make it suppressible. Fixed by @ltawfik.
2 bugs reported and fixed
  1. #20409 — Silent plugin skill registration failure. Unknown fields in plugin.json caused skills to silently fail to register — the plugin appeared loaded but skills weren't discoverable, with no error surfaced. Filed a report with a disclosure principles framework and proposal for warning badges and /doctor integration. Fixed by @blois.
  2. #12031 — PreToolUse hooks stripped AskUserQuestion answers. Any active PreToolUse hook caused the user's selection to be silently dropped. Filed a detailed report with a testing matrix isolating the bug to PreToolUse specifically (PostToolUse and SessionStart were unaffected). Fixed in v2.0.76.

vLLM (High-throughput LLM Inference Engine)

While working to get Gemma2, Gemma3, and other quantized models loading and running correctly in vLLM — particularly on Blackwell hardware (RTX 5090) — I traced and fixed a series of bugs in the GGUF backend: multi-process hangs, config mapping gaps, dtype conflicts, weight loading errors, and missing architecture support.

3 PRs merged
  1. #30209 — Skip generation config fallback for GGUF to prevent multi-process hang. Loading GGUF models in multi-process mode (V1 engine) caused an indefinite hang — both the EngineCore and APIServer processes tried to memory-map the same GGUF file. Fix skips the fallback entirely since GGUF files embed config in the file header.

  2. #30407 — Add memory barriers for cross-process shared memory visibility. Shared memory broadcast caused data races across process boundaries in multi-process inference. Added ordering guarantees to ensure correct visibility of shared state.

  3. #30408 — Disable bfloat16 for GGUF on Blackwell. GGUF models on Blackwell GPUs (RTX 5090, SM 120+) produced incorrect output because bfloat16 causes precision issues with quantized weights on this architecture. Fix defaults GGUF to float16 on Blackwell with a warning when bfloat16 is explicitly requested.

4 PRs open
  1. #30409 — Lazy tokenizer init to prevent GGUF semaphore leak. Repeated GGUF model loading/unloading exhausted system semaphores due to eager tokenizer initialization in StructuredOutputManager. Fix defers tokenizer init until first use.

  2. #30410 — Auto-select compatible dtype for GGUF on Blackwell. Gemma2/Gemma3 GGUF models on Blackwell hit a dtype deadlock: float16 causes numerical instability in Gemma, bfloat16 causes precision issues with GGUF on Blackwell. Fix adds _resolve_dtype_conflict() to auto-select float32 when both are disallowed.

  3. #30412 — Skip lm_head mapping for models with tied word embeddings. GGUF loading failed with RuntimeError: Failed to map GGUF parameters: ['lm_head.weight'] for models like Gemma2 that share weights between input embeddings and output projection. Fix adds lm_head.weight to sideload params when tie_word_embeddings=True.

  4. #30434 — Use EOS token ID from GGUF metadata instead of HF tokenizer. Gemma 3 GGUF models never stopped generating — the model emitted <end_of_turn> (token 106) but vLLM waited for the HF tokenizer's EOS (token 1), resulting in repeated EOS tokens until max_tokens. Fix reads the correct EOS from GGUF metadata.

10 PRs in progress
  1. #30411 — Ensure Gemma2 configs have hidden_act for backward compatibility. Gemma2 GGUF loading hit AttributeError: 'Gemma2Config' has no attribute 'hidden_act' because Transformers uses hidden_activation while vLLM accesses hidden_act directly. Fix copies the value across.

  2. #30413 — Add missing rotary positional embeddings to Nemotron-H attention layers. Nemotron-H models loaded successfully but generated corrupted output because the attention class had no RoPE initialization. Without positional information, attention scores were meaningless. Fix adds full rotary embedding support.

  3. #30421 — Skip missing parameters during GGUF Gemma2 weight loading. GGUF loader yielded qweight_type metadata for all quantized tensors including embeddings, but VocabParallelEmbedding doesn't have those parameters, causing a KeyError. Fix adds a safety check matching the existing pattern in llama.py.

  4. #30423 — Make GGUFMoEMethod.apply() parameters optional. GGUF MoE models (e.g., Qwen3-30B) failed because GGUFMoEMethod.apply() required top_k and renormalize arguments that were never passed by the caller and not used in the method body.

  5. #30424 — Add quant_config to Gemma2 embedding layer for GGUF support. Gemma2 GGUF models loaded successfully but produced garbage output because the embedding layer lacked quant_config, causing F.embedding() to interpret quantized bytes as float values. Same bug previously fixed for DeepSeek in #12836.

  6. #30427 — Extract attn_logit_softcapping from GGUF metadata. Gemma2 GGUF models produced garbage because FlashAttention used softcap=0 (disabled) without this parameter, causing numerical instability. The attention backends already supported softcap — this was a config mapping gap.

  7. #30500 — Extract HF config from GGUF metadata for repos without config.json. GGUF repos like bartowski's caused vLLM to fail at model loading. Fix adds a GGUF config parser that constructs HuggingFace-compatible config from GGUF metadata fields.

  8. #30699 — Skip missing parameters during GGUF Gemma2 weight loading. Targeted resubmission of #30421 fix — adds safety check in Gemma2Model.load_weights() to skip parameters not in params_dict.

  9. #30702 — Handle missing config.json in speculator probe for GGUF models. The speculator probe tried to load config.json before GGUF handling ran, failing at engine init. More targeted fix than #30500, following reviewer feedback that Transformers already handles GGUF config extraction.

  10. #31464 — Apply RMSNorm weight correction for Gemma2 GGUF models. Gemma2 GGUF models produced gibberish because llama.cpp adds 1 to RMSNorm weights during GGUF conversion, but vLLM expects original values. Fix subtracts 1 during loading, matching the correction already applied for Gemma3 in #26189. Tested on RTX 5090: coherent output, 40% MMLU accuracy, 344 tok/s.

Hugging Face Transformers: While debugging Gemma2/Gemma3 GGUF output quality in vLLM, I traced the root cause upstream — Transformers' GGUF loader wasn't mapping attn_logit_softcapping from GGUF metadata into the HuggingFace config, causing models to silently use the wrong default. #42881 adds the config mappings for both architectures; once merged, it replaces the workaround in vLLM #30427.

Tech Stack


CUDA Claude vLLM Transformers

GitHub Stats

Connect


Claude · AI/ML · LLM Inference · League of Legends · Mass Effect · Wild Rift · University of Waterloo

Pinned Loading

  1. huggingface/transformers huggingface/transformers Public

    🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

    Python 157k 32.2k

  2. ggml-org/llama.cpp ggml-org/llama.cpp Public

    LLM inference in C/C++

    C++ 95.6k 15k

  3. vllm-project/vllm vllm-project/vllm Public

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Python 70.9k 13.6k

  4. anthropics/claude-code anthropics/claude-code Public

    Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…

    Shell 68.6k 5.4k