Variations in nucleotide sequences often lead to significant changes in fitness. Nucleotide Foundation Models (NFMs) have emerged as a new paradigm in fitness prediction, enabling increasingly accurate estimation of fitness directly from sequence. However, assessing the advantages of these models remains challenging due to the use of diverse and specific experimental datasets, and their performance often varies markedly across different nucleic acid families, complicating fair comparisons.
To address this challenge, we introduce NABench, a large-scale, systematic benchmark specifically designed for nucleic acid fitness prediction. NABench integrates 2.6 million mutant sequences from 162 high-throughput assays, covering a wide range of DNA and RNA families. Within a standardized and unified evaluation framework, we rigorously assess 29 representative nucleotide foundation models.
NABench's evaluation covers a variety of complementary scenarios: zero-shot prediction, few-shot adaptation, supervised training, and transfer learning. Our experimental results quantify the heterogeneity in model performance across different tasks and nucleic acid families, revealing the strengths and weaknesses of each model. This curated benchmark lays the groundwork for the development of next-generation nucleotide foundation models, poised to drive impactful applications in cellular biology and nucleic acid drug discovery.
Figure 1: The NABench Benchmark Framework.
Our comprehensive evaluation reveals a complex and interesting performance landscape where no single model or architectural family dominates across all settings. The most striking finding is a clear performance dichotomy between different architectural families across zero-shot and supervised settings.
- In the zero-shot setting, autoregressive models (e.g., GPT-like) and state-space models (e.g., Hyena/Evo series) show a clear advantage.
- When labeled data is introduced, in supervised and few-shot scenarios, many BERT-like models demonstrate a remarkable ability to learn, often outperforming the generative models.
This suggests fundamental differences in the nature of the representations learned by these architectures. Detailed performance files and more in-depth analyses (e.g., breakdowns by nucleic acid type, mutational depth) can be found in the benchmarks folder.
Our benchmark evaluates a total of 29 nucleotide foundation models, which are categorized into four main architectural classes: BERT-like, GPT-like, Hyena, and LLaMA-based.
| Model | Params | Max Length | Tokenization | Architecture |
|---|---|---|---|---|
| LucaVirus | 1.8B | 1280 | Single | BERT |
| Evo2-7B-base | 7B | 8192 | Single | Hyena |
| Evo2-7B | 7B | 131072 | Single | Hyena |
| Evo-1-8k | 6.45B | 8192 | Single | Hyena |
| Evo-1-8k-base | 6.45B | 131072 | single | Hyena |
| GENA-LM | 336M | 512 | k-mer | BERT |
| N.T.v2 | 500M | 2048 | k-mer | BERT |
| N.T.v2 | 50M | 2048 | k-mer | BERT |
| CRAFTS | 161M | 1024 | Single | GPT |
| LucaOne | 1.8B | 1280 | Single | BERT |
| AIDO.RNA | 1.6B | 1024 | Single | BERT |
| BiRNA-BERT | 117M | dynamic | BPE | BERT |
| Evo-1.5 | 6.45B | 131072 | Single | Hyena |
| GenSLM | 2.5B | 2048 | Codon | BERT |
| HyenaDNA | 54.6M | up to 1M | Single | Hyena |
| N.T. | 500M | 1000 | k-mer | BERT |
| RFAMLlama | 88M | 2048 | Single | GPT |
| RNA-FM | 99.52M | 1024 | Single | BERT |
| RNAErnie | 105M | 1024 | Single | BERT |
| GenerRNA | 350M | dynamic | BPE | GPT |
| DNABERT | 117M | dynamic | k-mer | BERT |
| RINALMo | 650M | 1022 | Single | BERT |
| Enformer | 251M | 196608 | Single | BERT |
| SPACE | 588M | 131072 | Single | BERT |
| GENERator | 3B | 16384 | 6-mer | GPT |
| RESM | 150M | dynamic | Single | BERT |
| RESM | 650M | dynamic | Single | BERT |
| structRFM | 86M | 512 | Single | BERT |
At the moment, the DMS assay data used in the paper are already available directly in the data directory. For the SELEX data used in the manuscript, we are still finalizing the organization and cleaning of the processed results. We plan to release these data publicly within the next few months.
If you would like to suggest a new fitness dataset to be included in NABench, please open an issue with the new_assay label. We typically consider the following criteria for inclusion:
- The corresponding raw dataset must be publicly available.
- The assay must be related to nucleic acids (DNA/RNA).
- The dataset needs to have a sufficient number of variant measurements.
- The assay should have a sufficiently high dynamic range.
- The assay must be relevant to fitness prediction.
If you would like to include a new baseline model in NABench, please follow these steps:
- Submit a Pull Request containing:
- A new subfolder under scripts/ named after your model. This folder should contain a scoring script seq_emb.py and a run script seq_emb.sh, similar to other models in the repository.
- All code dependencies required for the scoring script to run properly.
- Open an issue with the new_model label, providing instructions on how to download relevant model checkpoints and reporting your model's performance on the relevant benchmark using our performance scripts.
Currently, we are only considering models that meet the following conditions:
- The model is able to score all mutants in the relevant benchmark.
- The corresponding model is open-source to allow for reproducibility.
Environment Setup We recommend using Conda to create and manage your Python environment:
# (Recommended) Create environment with conda
conda create -n nabench python=3.9
conda activate nabench
# Install dependencies with pip
pip install -r requirements.txtDownload Data Download the necessary data from the Resources section above and unzip it into your project's root directory or a specified path.
Generate Sequence Embeddings Our scripts directory provides a standardized embedding extraction pipeline for each model. To generate embeddings for a specific model, run:
# Example for DNABERT
bash scripts/dnabert/seq_emb.sh path/to/input/data.csv path/to/output/embeddings.ptPlease refer to the README or script comments in each model's directory for detailed parameters.
Evaluate Model Performance After generating embeddings/scores for all models, you can use our evaluation scripts to compute performance metrics.
# Example command (specific script to be provided by you)
python evaluate.py --scores_dir path/to/scores --output_dir benchmarks/This script will generate detailed performance reports, including metrics aggregated by different dimensions (e.g., nucleic acid type, evaluation setting).
If you find this codebase useful for your research, please consider citing our paper.
@article{nabench,
title={{NABench}: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction},
author={Zhongmin Li, Runze Ma, Jiahao Tan, Chengzi Tan, Shuangjia Zheng},
journal={arXiv preprint arXiv:2511.02888},
year={2025}
}
We thank all the researchers and experimentalists who developed the original assays and foundation models that made this benchmark possible. We also acknowledge the invaluable contributions of the communities behind ProteinGym and RNAGym, which heavily inspired this work.
Please consider citing the corresponding papers of the models and datasets you use from this benchmark.
