The NMD-Scanner is a Python-based variant effect annotation tool that predicts the likelihood of transcript degradation through nonsense-mediated decay (NMD). It reconstructs reference and alternative coding sequences as well as transcript sequences in some cases, identifies premature termination codons (PTCs), and evaluates canonical and non-canonical NMD escape rules. It can handle single-nucleotide variants, multiple base substitutions, long and short deletions and duplications as well as frameshift variants.
- Reconstructs reference and alternative CDS, reference transcript sequence and (in some cases) the alternative transcript sequences with metadata
- Detects start / stop-loss and premature termination codons (PTCs) with the exact position in the CDS and in which exon it lies
- Computes different NMD-related features:
- Total, upstream and downstream exon count
- Distance of PTC to original stop codon
- Distance of PTC to start codon
- Transcript length
- 3' and 5' UTR lengths
- Evaluates five canonical NMD escape rules:
- Last exon rule
- 50nt penultimate rule
- Long exon rule
- Start-proximal rule
- Single-exon rule
- Outputs all annotations as a structured DataFrame (CSV)
git clone https://github.com/gagneurlab/NMD-Scanner.git
cd NMD-Scanner
pip install .# if running the script directly
python -m nmd_scanner.cli --vcf input.vcf --gtf annotation.gtf --fasta reference.fa --output results/
# option: fix exon numbering (recommended for hg19)
python -m nmd_scanner.cli --vcf input.vcf --gtf annotation.gtf --fasta reference.fa --output results/ --reassign_exonsArguments:
--vcf: Path to input VCF (SNVs / Indels supported; frameshifts handled)--gtf: Path to gene annotation (GTF)--fasta: Path to reference genome FASTA--output: Path to an existing directory (or a file path whose parent exists)--reassign_exons: (flag) Recompute exon numbers (useful for hg19)
Output:
- A CSV named
<vcf_basename>_final_nmd_results.csvsaved to--output, containing:- reconstructed reference / alternative CDS and transcript sequences(+ metadata)
- PTC detection and start / stop-loss flags
- NMD escape rules
- extra features such as UTR lengths, exon counts, distances, etc.)
Instead of running the entire pipeline, you can import NMD-Scanner in Python and call only specific components. This is useful if you want to
- only reconstruct transcript / CDS sequences
- only compute NMD escape rules
- integrate NMD-Scanner into a larger workflow
- build custom features
For reconstructing reference and alternative coding and transcript sequences, PTC detection and start / stop-loss information:
import pandas as pd
import pyranges as pr
from pyfaidx import Fasta
import nmd_scanner
vcf = nmd_scanner.read_vcf("input.vcf")
gtf_pr = nmd_scanner.read_gtf("annotation.gtf")
fasta = Fasta("reference.fa")
# Optional: fix exon numbering (recommended for hg19)
gtf_pr = nmd_scanner.compute_exon_numbers(gtf_pr)
gtf_df = gtf_pr.df
cds_df = gtf_df[gtf_df["Feature"] == "CDS"]
exons_df = gtf_df[gtf_df["Feature"] == "exon"].copy()
exons_df["exon_length"] = exons_df["End"] - exons_df["Start"]
results = extract_ptc(cds_df, vcf, fasta, exons_df, output="tmp/")Add NMD escape rules (last exon rule, 50 nt penultimate rule, long exon rule, start proximal rule, single exon rule, nmd escape) to the above computed results:
nmd_results = results.apply(nmd_scanner.evaluate_nmd_escape_rules, axis=1, result_type='expand')
results = pd.concat([results, nmd_results], axis=1)Add extra NMD-related features (utr lengths, exon counts, ptc-related features) to above computed results:
extra_features = results.apply(nmd_scanner.add_nmd_features, axis=1, result_type='expand')
results = pd.concat([results, extra_features], axis=1)All source code in this repository is licensed under the MIT License.
Schröder, C.H. (2025). Enhanced Aberrant Gene Expression Prediction across Human Tissues. Master's Thesis, Technical University of Munich / Ludwig-Maximilians-Universität München.