Skip to content

Evaluation of BaseNumber, a GPU-accelerated variant calling tool on whole genome sequencing datasets

Notifications You must be signed in to change notification settings

WCH-IRD/BaseNumber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

Benchmarking germline variant calling performance of a GPU-accelerated tool on whole genome sequencing datasets

We comprehensively evaluated the germline variant calling of a GPU-based acceleration tool, BaseNumber, using WGS datasets from various sources. These included standards of whole genome sequencing (WGS) data from the Genome in a Bottle (GIAB) and the Golden Standard of China Genome (GSCG) projects, resequenced GSCG samples, and 100 in-house samples from the Genome Sequencing of Rare Diseases (GSRD) project. The variant calling outputs were compared to the reference and the results generated by the Burrows-Wheeler Aligner (BWA) and Genome Analysis Toolkit (GATK) pipeline. BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The output comparison between BaseNumber and GATK pipelines yielded nearly identical results with a mean F1 score of 99.69%. Additionally, BaseNumber took 23 minutes on average to analyze a 48X WGS sample, which was 215.33 times shorter than the GATK workflow.

Evaluation workflow

Figure 1. The workflow of evaluation study

Accuracy

Figure 2. The precision and recall of BaseNumber calling results compared with GIAB and GSCS standards

Efficiency and consistency

Figure 3. The comparison of speed and consistency of BaseNumber and GATK

Data availability

The GIAB and GSCG reference data utilized in this study are available through the following URLs, including

GIAB standard fastq files: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/

GIAB standard VCF files (v3.3.2): https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv3.3.2/GRCh38/

GSCG standard Fastq files: http://chinese-quartet.org/#/data/download/quartet-genomics

GSCG standard VCF files (v1.0): http://chinese-quartet.org/#/data/download/quartet-genomics

The VCF files of GIAB and GSCG samples called in this study are available through the following URL: https://vcf-for-benchmark-paper.tos-cn-beijing.volces.com/BaseNumber_VCFs.zip

Availability of BaseNumber Demo

BaseNumber is available for trial on the AWS US East (N. Virginia) via https://us-east-1.console.aws.amazon.com/. To obtain the AMI image, please contact bufengxiao@wchscu.cn.

To use the BaseNumber image, goes to the "EC2" Service EC2

Choose "Images"/"AMI Catalog" on the left navigation panel, select "My AMIs" on the main panel. A AMI image named "germline" should be listed. AMI_Catalog

Choose "Images"/"AMIs" on the left navigation panel, select "germline" image on the top, launch the instance from AMI. launch_AMI

To launch the instance, the "p3.8xlarge" type is recommanded.

Following screenshots show the configuration of the instance details, including the number of instances you want to launch, networking options, and storage options. Ensure that the "io2" volume type is selected for the desired storage (64000 iops and >500GB are recommanded).

If you are encountering any limitations, you may need to request an increase in those limits from AWS support. These limitations can vary depending on your account type and usage. Configure any additional settings as per your requirements, such as security groups, IAM roles, or user data. Review the instance details and click "Launch" to start the instance. You will be prompted to select an existing key pair or create a new one for secure access to your instance. Choose the appropriate option and click "Launch Instances." config1 config2 config3 config4 config5

Next, connect to the instance. connect

The data for trial is in the ~/data/ folder. For alignment, the script is

cd  /data
./sla -R '@RG\tID:test\tSM:test' \
-B ./bam.split/ \
./hs37d5/hs37d5.fa \
./SRR8454589_1.fastq.gz \
./SRR8454589_2.fastq.gz 

align

For variant calling, the script is

./slc -P ./bam.split \
-R ./hs37d5/hs37d5.fa \
--keep-split \
-o ./tmp.vcf \
-b ./tmp.bam

calling results More details can be found in the "help" of "sla" and "slc".

About

Evaluation of BaseNumber, a GPU-accelerated variant calling tool on whole genome sequencing datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published