I am a Computer Science PhD student working on efficient large language model inference systems. My recent work sits at the intersection of KV Cache optimization, hierarchical memory management, and practical serving-system design.
I am especially interested in building methods that are not only effective on paper, but also honest under real system constraints such as memory fragmentation, bandwidth pressure, and latency-throughput tradeoffs.
- 🧠 KV Cache pruning, compression, and offloading
- 📏 long-context inference optimization
- 🚀 vLLM-style serving systems and performance tuning
- 📱 edge-side deployment for resource-constrained devices
I like research that feels a bit like system detective work:
🔍 Find where the real bottleneck is hiding.
🧠 Figure out whether the cost comes from memory, movement, or scheduling.
⚙️ Turn that pain point into something measurable and optimizable.
📊 Test whether the idea still holds under realistic workloads.
Python for fast prototyping, C++ for systems work, Linux for getting close to the machine, and agent tools for making research workflows a little less manual.
- ✍️ I enjoy explaining system ideas as much as building them
- 🚴 I spend time on badminton, cycling, photography, and sci-fi / mystery reading
- ✨ I like projects that feel rigorous, useful, and a little elegant
Optimize what matters. Keep the system honest.
