Skip to content
View SwayamInSync's full-sized avatar
:octocat:
Swayam ❤️ Open Source
:octocat:
Swayam ❤️ Open Source

Organizations

@numpy @microsoft @Azure @conda-forge @The-Asynchronous-Lab

Block or report SwayamInSync

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

Optimized Kernels

14 repositories

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.

Python 50,398 4,163 Updated Jan 6, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 66,931 12,412 Updated Jan 6, 2026

Efficient Triton Kernels for LLM Training

Python 6,012 458 Updated Jan 5, 2026

Dynamic Memory Management for Serving LLMs without PagedAttention

C 454 34 Updated May 30, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,957 925 Updated Dec 15, 2025

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

C 7,197 1,632 Updated Jan 6, 2026

Load compute kernels from the Hub

Python 357 29 Updated Dec 17, 2025

Custom PTX Instruction Benchmark

Cuda 137 10 Updated Feb 27, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,231 909 Updated Jan 4, 2026

A repository to unravel the language of GPUs, making their kernel conversations easy to understand

Python 195 7 Updated Jun 1, 2025

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

247 12 Updated May 6, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 4,428 378 Updated Jan 6, 2026

CUDA Library Samples

C++ 2,270 439 Updated Jan 5, 2026

ring-attention experiments

Python 161 14 Updated Oct 17, 2024