ML Engineer | Training β Inference β Production
8+ years building ML systems end-to-end: from training custom models to optimizing inference at the GPU kernel level.
π₯ Inference Optimization β Custom Triton kernels (Flash Attention, LayerNorm, GELU) achieving 1.85x speedup
π€ LLM Applications β Production RAG/Agents with LangGraph, 60-70% cost reduction
π§ Model Training β Fine-tuning transformers (BERT, T5, Llama) with LoRA/QLoRA
βοΈ Production β GCP/AWS deployments serving millions of requests
Triton-GPT2 β GPU Kernel Development
GPT-2 inference engine with custom Triton kernels. 275 TPS vs 149 TPS HuggingFace (1.85x speedup)
- Fused Flash Attention matching PyTorch SDPA
- Custom LayerNorm, GELU, Softmax kernels
- KV-cache for autoregressive decoding
Meditations RAG β Production LLM Application
Agentic RAG with LangGraph: Controller β Retriever β Generator β Evaluator loop
Live Demo | Sub-500ms latency | 500+ RPS
GPU/Inference: Triton, CUDA, vLLM, TGI, Quantization (GPTQ, AWQ)
LLM Apps: LangGraph, LangChain, LlamaIndex, RAG
Training: PyTorch, LoRA/QLoRA, Mixed Precision, DeepSpeed
Production: GCP, Docker, Kubernetes, FastAPI, Redis
π« Open to remote opportunities (contract or full-time) | Flexible on US/EU timezones

