Skip to content
View dextermayhewjd's full-sized avatar
  • University of Bristol
  • Bristol
  • 21:52 (UTC)

Block or report dextermayhewjd

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
dextermayhewjd/README.md

Hi there, I'm Dexter Ding (Jiahua DIng) 👋

AI Infra Engineer focused on Large Language Model (LLM) Pre&Post-training , inference optimization. Currently developing expertise in all training tech stack, GPU kernel optimization (CUDA/Triton). Combines solid ML theory, PyTorch engineering, and system-level optimization to build scalable, high-efficiency AI solutions.

"Building a complete stack understanding from model math to GPU execution pipeline, and from training to inferencing"

Languages and Tools:

Domain Skills / Tools
LLM Training PyTorch, LLaMA-Factory,Verl, Megatron, Deepspeed
Inference Optimization vLLM, CUDA, Tesor Core, Nsight Systems
GPU & Systems CUDA C++, PTX profiling, memory hierarchy tuning, Cutlass , cuBLAS, cuDNN
Efficient Computing Pruning, Quantization, Knowledge Distillation, Kernel Fusion
Math Foundations Linear Algebra, Probability, Information Theory (KL, CE), Optimization


⚡ GitHub Stats

Pinned Loading

  1. AI-Systems-C-LLM-Infra AI-Systems-C-LLM-Infra Public

    AI Systems & C++ / LLM Infra 学习课表(长期规划) 本仓库用于系统性学习 C++ / Systems / Parallel Computing / LLM Systems / AI Infra 目标是具备 独立实现 LLM 推理与训练基础设施(Inference / Training Infra) 的能力。

    Python

  2. nano_vllm nano_vllm Public

    从0构建nano vllm 使用c++ C++/CUDA+Cutlass/NCCL + 编译优化,覆盖推理全链路性能瓶颈,性能上限对标工业级框架

    Python

  3. assignment1-basics assignment1-basics Public

    Forked from stanford-cs336/assignment1-basics

    学生版手搓Stanford CS336 a1 - Language Modeling From Scratch 配有学习笔记 完成了BPE分词器到transformer中各个module的构建 中间使用mapreduce减少了oom的触发 配合编写的单机训练job 系统 配合omegaconfig 完成消融实验

    Python

  4. learn_cuda-triton learn_cuda-triton Public

    包含了从pmpp 的学习 以及单机测试矩阵乘法速度 以及对于tensor core的探索

    C++

  5. advanced-hpc-lbm advanced-hpc-lbm Public

    Forked from UoB-HPC/advanced-hpc-lbm

    COMS30006 - Advanced High Performance Computing - Lattice Boltzmann

    C 1

  6. hitfox hitfox Public

    Year3 GameDevelopment using C# and unity

    C# 6