π Software Engineer / AI Engineer (5+ years) focused on CUDA/C++, ML Systems / Distributed Training, and LLM infrastructure.
I like building systems that connect low-level performance with real-world ML products β from GPU kernels to scalable AI pipelines.
A minimalist deep learning framework built from scratch:
- Dynamic autograd + tensor ops (CPU/CUDA)
- Core layers + optimizers
- Serialization + checkpointing
- ONNX export + benchmarks vs PyTorch
π https://github.com//nanotorch
- CUDA kernel optimization (matmul/elementwise/fused ops)
- Profiling + memory efficiency
- Distributed training experiments + benchmarking
- LangGraph/LangChain multi-agent workflows
- Hybrid retrieval (pgvector + keyword)
- Enterprise ingestion (Microsoft Graph / Google Drive)
- Evaluation + grounding for reliability
- Built production systems: microservices, Kubernetes, PostgreSQL, AWS
- AI systems: agents, RAG pipelines, retrieval + evaluation
- Strong GPU/C++ focus (CUDA, performance engineering)
- Ontario Graduate Certificate (16 months) β Wireless Information Networking
GPA: 3.33/4.0 (4 semesters)
C++ | CUDA | Python | PostgreSQL | Docker | Kubernetes | AWS | FastAPI | gRPC | LangGraph/LangChain | pgvector | Prometheus/Grafana
- NanoTorch β C++/CUDA DL framework
π https://github.com//nanotorch - RAG / Multi-Agent Pipelines
π https://github.com//rag-agents - CUDA Kernels / Experiments
π https://github.com//cuda-kernels
π§ deepanshut041@gamil.com
πΌ LinkedIn:
π GitHub: https://github.com/deepanshut041



