kernel-fusion

Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.

cuda accelerators kernel-fusion alpaka

Updated Feb 20, 2024
C++

ParCoreLab / gpu-fusion

Star

GPU fusion code and algorithm

gpu cuda kernel-fusion

Updated May 24, 2024
Cuda

fraidakis / PDS_BitonicSortCUDA

Star

Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh)

cuda shared-memory radix-sort bitonic-sort nvidia-gpu kernel-fusion

Updated Mar 16, 2025
Cuda

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Star

High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.

deep-learning cuda pytorch gpu-optimization kernel-fusion layernorm

Updated Aug 17, 2025
Python

Improve this page

Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-fusion

Here are 10 public repositories matching this topic...

tracel-ai / burn

ROCm / iris

chhzh123 / Krill

wu-kan / GoPTX

nopperl / pytorch-fused-lamb

svdrecbd / mhc-mlx

ShkalikovOleh / alpaka_expr_trees

ParCoreLab / gpu-fusion

fraidakis / PDS_BitonicSortCUDA

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Improve this page

Add this topic to your repo