████ █ █ ████ ████ █ █ █ █ █ █ ████ █ █ ████ █ ███ █ █ █ █ █ █ ████ █ █ ████ |
SVPG (Structural Variant detection based on Pangenome Graph) is a computational tool designed for structural variation (SV) detection and efficient pangenome graph augmentation. With the growing availability of long-read sequencing data and pangenome references, SVPG fills a critical gap by enabling accurate SV discovery and scalable integration of new genomes into existing pangenome graphs.
|
-
Dual SV detection modes:
- Pangenome-guided mode: Extracts SV-supporting reads from BAM files, and realigns a pangenome reference graph. By analyzing the graph alignment's topological and path transition features to detect germline SVs with high precision.
- Graph-based mode: Directly resolves reads-to-graph alignments to discover de novo SVs within haplotype paths of pangenome graph, ideal for conducting reference-bias-free low-frequency/somatic SV discovery without relying on prior SV databases or annotations.
-
High sensitivity and accuracy SV detection: Demonstrates superior performance in benchmarking against state-of-the-art SV callers across both population-wide germline and individual-specific SVs.
-
Rapid graph augmentation: Designed to work seamlessly with the graph-call mode, it accelerates pangenome augmentation by nearly an order of magnitude compared to traditional de novo assembly methods on cohorts of dozens of samples, enabling fast and scalable integration of new samples.
$ pip install svpg
or
$ conda install svpg
or
$ git clone https://github.com/coopsor/SVPG.git && cd SVPG/ && pip install . - Python >= 3.10 (tested on v3.10.4)
- pysam >= 0.22 for BAM file processing
- numpy >= 1.26.4 for numerical computing
- scipy >= 1.13.1 for scientific computing
- pyabpoa >= 1.5.4 for consensus sequence generation
The following tools must be available in your system path (recommend installing via conda):
- minigraph >= 0.21 for pangenome graph alignment in pangenome-guided mode
- mappy >= 2.28 for consensus sequence realignment in pangenome-guided mode
- bcftools >= 1.20 for VCFs processing in augmentation mode
- truvari >= 3.1.0 for VCFs merging in augmentation mode
-
Pangenome-guided mode requires an input of read-reference alignment results in coordinate-sorted and indexed BAM file. If you start with sequencing reads (e.g., FASTA/FASTQ files), you need to map them to a linear reference genome first.
-
SVPG support parallelized and uses 16 threads by default. This value can be adapted using e.g.
-t4 as option. -
SVPG was evaluated on the first and second releases of the HPRC pangenome graphs (v3.1 and v4.1). Benchmark results indicate that SVPG achieves nearly identical performance on both versions.
-
By default, SVPG outputs all SVs supported by more than one read. In pangenome-guided mode, users can according to genotype-assigned variants using
FILTER=PASSto obtain a more high-confidence SV set. In addition, users may manually adjust the minimum read support threshold with the--min_support/-sparameter based on sequencing depth with the following table for reference. This is particularly useful for ultra-low-coverage datasets (<10×) to preserve recall, as well as for graph-based mode with genotyping is not available.Depth (×) ONT HiFi <10 2 1 [10, 20) 3 2 [20, 50) 4 3 ≥50 10 4
svpg call --working_dir svpg_out/ --bam sample.bam --ref hg38.fa --gfa pangenome.gfa --read ontThe called file variants.vcf was saved in the specified working directory. -o option can be used to specify the output file name.
- Graph-based mode requires an input of read-graph alignment results in GAF format. If you start with sequencing reads (e.g., FASTA/FASTQ files), you need to map them to a pangenome. We recommend to produce the alignments using minigraph.
- Since minigraph by default outputs stable coordinates in rGFA format, SVPG requires the
--vcoption to be enabled during alignment to support more general GFA formats (e.g., GraphAligner alignment result).
minigraph -cx lr --vc -t 64 pangenome.gfa sample.fasta > sample.gaf
svpg graph-call --working_dir svpg_out/ --ref hg38.fa --gfa pangenome.gfa --gaf sample.gaf --read ont -s 3- SVPG leverages a pangenome as a panel for filtering germline and population-level SVs, and therefore outputs tumor-only SVs by default. For Tumor/Normal paired analysis, we recommend running the two samples separately and then integrating the results with our script to achieve optimal performance.
svpg graph-call --working_dir tumor_out/ --ref hg38.fa --gfa pangenome.gfa --gaf tumor.gaf --read hifi -s 3
svpg graph-call --working_dir normal_out/ --ref hg38.fa --gfa pangenome.gfa --gaf normal.gaf --read hifi -s 1
python scripts/vcf_specific.py tumor_out/variants.vcf normal_out/variants.vcf tumor_specific.vcfThis procedure selects SVs that are present only in the tumor sample but absent in the matched normal.
SVPG provides a streamlined pipeline to rapidly embed de novo SVs detected from graph-based alignment back into the pangenome graph.
To use this feature, users should place a directory containing the raw sequencing data (e.g., FASTA/FASTQ files) of new samples under the specified working_dir path. For example:
working_dir/
├── sample_1/
│ └── sample_1.fasta
├── sample_2/
│ └── sample_2.fastaSVPG will automatically detect SV in graph-based mode and process these VCFs for graph augmentation, and the output file augment.gfa is placed into the given working directory.
svpg augment --working_dir svpg_out/ --ref hg38.fa --gfa pangenome.gfa --read hifiAlternatively, you may provide a .tsv file listing the paths to FASTA files of new samples.
For example, the sample.tsv file may look like(sample_1 name ≠ sample_2 name):
/path/to/sample_1.fasta \n /path/to/sample_2.fasta
then, run the command svpg augment --working_dir svpg_out/ --sample_list sample.tsv --ref hg38.fa --gfa pangenome.gfa --read hifi
| Parameter | Description | Default |
|---|---|---|
--working_dir |
Specify the working directory to store output files. | Required |
--bam |
Coordinate-sorted and indexed BAM file with aligned long reads. | Required for call mode |
--gaf |
GAF file with long reads aligned to the pangenome graph (.gaf). | Required for graph-call mode |
--ref |
The reference genome used for pangenome construction (.fa), is also serves as the coordinate system for SVPG’s SV call output. | Required |
--gfa |
Pangenome reference file that the long reads were aligned to (.gfa). | Required |
--read |
Type of sequencing reads: ont for Oxford Nanopore, hifi for PacBio HiFi. |
hifi |
--min_support/-s |
Minimum read support threshold for SV calling. Adjust based on sequencing depth. | 2 |
--num_threads/-t |
Number of threads to use for parallel processing. | 16 |
--min_mapq |
Minimum mapping quality for reads to be considered in SV detection. | 20 |
--min_sv_size |
Minimum size of SVs to be detected. | 50 |
--max_sv_size |
Maximum size of SVs to be detected. Set to -1 for unlimited size (recommend for somatic SV of graph-call mode). |
1,000,00 |
--max_merge_threshold |
Maximum distance of SV signals to be merged. | 50 for hifi read and 500 for ont read |
--ultra_split_size |
Ignore extremely large BNDs from split alignments unless supported by high enough reads, which may be regarded as false-negative intra-chromosomal translocation. | 1000000 |
--alt_consensus |
Generate alternative allele consensus sequences for insertion using pyabpoa. | Disable |
--noseq |
Disable sequence extraction for SVs. Useful for ultra-large SVs to save time and disk space. | Disabled |
--types |
Specify the types of SVs to call: DEL, INS, DUP, INV, BND. Separate multiple types with commas. | DEL,INS,DUP,INV,BND |
--contigs |
Specify the chromosomes list to call SVs (e.g., --contigs chr1 chr2 chrX)'. | All chromosomes |
--skip_genotype |
Skip genotyping step to speed up the process for call mode. |
Disabled |
--realign |
Realign the noise reads to the reference for more accurate SV sequence inference for call mode. |
Disabled |
--sample_list |
Path to a TSV file listing the paths to FASTA files of new samples for augment mode. |
Optional; if not provided, all FASTA files under working_dir will be processed. |
--skip_call |
Skip SV calling step and directly proceed to graph augmentation using existing VCF files in the working directory. | Disabled |
--out/-o |
Specify the output file name. | variants.vcf for call and graph-call modes, augment.gfa for augment mode |
--version/-v |
Show the version of SVPG. | N/A |
--help/-h |
Show help message and exit. | N/A |
- SVPG's pangenome-guided mode relies on minigraph to realign SV signature reads to the pangenome graph. Although this step introduces some overhead, this process is relatively fast: in our tests on the HG002 sample, realignment took approximately 10 minutes for ONT (50×) data and 4 minutes for HiFi (48×) data.
- The
--realignmodule provides more accurate breakpoint resolution in graph-hard-alignment regions (for example, LCRs). On the latest HG002-Q100 benchmark, this module yields measurable performance improvements. However, it relies on pyabpoa and mappy to perform local re-alignment, which introduces additional computational overhead (e.g., ~1 hour extra for 48× HG002 HiFi data). As this feature is still experimental, we recommend enabling it in analyses that require base-pair–level breakpoint accuracy. - The graph-based mode currently does not support genotyping. Users should manually adjust the minimum read support threshold using the
--min_support/-sparameter based on sequencing depth to balance sensitivity and precision.
Refer to our paper for further details and citation:
Hu, H. et al. SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples. bioRxiv, 2025.2007.2011.664486 (2025).
For questions or support, please open an issue on GitHub or contact the authors at hhengwork@gmail.com.
