TurboFNO is a high-performance GPU implementation of the Fourier Neural Operator (FNO), designed for solving PDEs with deep learning. Unlike standard FNO implementations that execute FFT, filtering, GEMM, and iFFT as separate kernels—causing redundant memory traffic—TurboFNO introduces the first fully fused FFT–GEMM–iFFT GPU kernel with built-in optimizations.
TurboFNO includes:
-
Custom high-performance FFT and GEMM kernels, matching or exceeding cuFFT and cuBLAS performance;
-
Built-in zero-padding, frequency truncation, and channel pruning to eliminate unnecessary data movement;
-
A novel kernel fusion strategy where each thread block traverses the hidden dimension, aligning FFT output with GEMM computation;
-
Shared memory swizzling techniques to ensure 100% memory bank utilization between FFT, GEMM, and iFFT stages.
-
On NVIDIA A100, TurboFNO achieves up to 1.5× speedup over PyTorch and NVIDIA’s closed-source libraries.
TurboFNO/
├── fusion_variants/ # All kernel fusion variants (stepwise E→A→B→C→D for 1D/2D)
├── benchmark_config/ # Input problem sizes for 1D and 2D
├── TurboFFT/ # Git submodule (TurboFNO_dev branch)
├── utils/ # Shared code and support modules
├── install.sh # Batch compile and PATH setup script
└── README.md
-
Clone the repository:
git clone https://github.com/shixun404/TurboFNO.git cd TurboFNO -
Initialize the TurboFFT submodule (required branch:
TurboFNO_dev):git submodule update --init --recursive
-
Set the project root environment variable (used by all CMake builds):
export PROJECT_ROOT=$(pwd)
-
Build all kernel fusion variants (1D and 2D):
bash install.sh
-
Temporarily add all compiled binaries to your PATH:
source $PROJECT_ROOT/envpath.sh
-
[Optional] Clean all builds:
bash install.sh clean
💡
envpath.shonly modifies yourPATHfor the current terminal session. It is auto-generated byinstall.sh.
Currently, only complex-to-complex (C2C) FFTs are supported, and the frequency domain is truncated to size 64 after applying the high-frequency filter.
| Variant | Executable Name | Fusion Strategy | Description |
|---|---|---|---|
| E | TurboFNO_1D_E / TurboFNO_2D_E |
No fusion | Baseline |
| A | TurboFNO_1D_A / TurboFNO_2D_A |
FFT + GEMM + iFFT (separate kernels) | Initial kernel sequence |
| B | TurboFNO_1D_B / TurboFNO_2D_B |
Fused FFT + GEMM | First-stage fusion |
| C | TurboFNO_1D_C / TurboFNO_2D_C |
FFT + Fused GEMM + iFFT | Mid-stage fusion |
| D | TurboFNO_1D_D / TurboFNO_2D_D |
Fully fused FFT + GEMM + iFFT | Final optimized implementation |
TurboFNO_1D_ASample Output
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=8 , TIME= 0.026ms
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=16 , TIME= 0.028ms
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=24 , TIME= 0.031ms
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=32 , TIME= 0.034ms
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=40 , TIME= 0.036ms
1D_A, bs=1 , dimX=1 , DY=128 , N=64 , K=48 , TIME= 0.039msEach variant builds a TurboFNO binary that accepts problem size configurations via a runtime .txt config file (no recompilation needed).
bs_list = 1 2 4 8 16 32 64
dimX_list = 1
DY_list = 128 256
N_list = 64 128
K_list = 8 16 24 32
⚠️ If no config path is provided, a default path is compiled in via CMake.
We progressively optimize the kernel execution from unfused (baseline) to fully fused implementation. Below are the benchmark visualizations for 1D and 2D cases.
Currently only supports C2C FFT and with size 64 after high-frequency filter.
FFT kernels are auto-generated. You can customize templates in:
TurboFFT/TurboFFT/include/code_gen/generated/...
Changes require a rebuild of the corresponding variant.
Tuning parameters (e.g., tile sizes, threads per block) are set in:
utils/TurboFNO.h
These control shared memory layout, tiling, and warp fusion strategies.
If you use TurboFNO in your work, please cite:
@article{wu2025turbofno,
title={TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU},
author={Wu, Shixun and Zhai, Yujia and Dai, Huangliang and Zhao, Hairui and Zhu, Yue and Hu, Haiyang and Chen, Zizhong},
journal={arXiv preprint arXiv:2504.11681},
year={2025}
}







