Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
-
Updated
Jan 16, 2026 - Rust
Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
An efficient concurrent graph processing system
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
LAMB go brrr
MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.
Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.
Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh)
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.
To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."