Illumina SARS-CoV-2 data processing using the SIGNAL pipeline
This repository provides a WDL wrapper for running the SIGNAL pipeline to process Illumina paired-end SARS-CoV-2 sequencing data.
An input template file with some defaults pre-defined can be found here.
| Input | Description |
|---|---|
accession |
Sample ID |
fastq_R1s, fastq_R2s |
Array of paired FASTQ file locations; paired files should be at the same index in each array |
scheme_bed |
The BED-format primer scheme used to prepare the library |
viral_reference_genome |
The SARS-CoV-2 reference genome |
viral_reference_feature_coords |
Feature coordinates for the SARS-CoV-2 reference genome |
viral_reference_contig_name |
[MN908947.3] |
primer_pairs_tsv |
Primer pair TSV file; used for iVar's amplicon filter. This file is a headerless TSV containing one row per primer pair, with the LEFT primer names in column 1 and the RIGHT column names in column 2. |
amplicon_bed |
BED-formatted amplicon locations |
container_registry |
Registry that hosts workflow containers. All containers are hosted in DNAstack's Dockerhub [dnastack] |
Primer schemes will differ based on the protocols used by the sequencing lab. Some common schemes can be downloaded from the official artic-network github. Additional schemes can be found in the SIGNAL repository. Primers from these locations map to the inputs as follows:
scheme_bed: ends in.primer.bedamplicon_bed: ends in.scheme.bedprimer_pairs_tsv: this file is not provided directly, but can be generated from theamplicon_bedfile
Example command to generate primer_pairs_tsv using the ARTIC V3 scheme bed:
paste \
<(cut -f 4 nCoV-2019.scheme.bed | sort -t _ -k 2 -g | grep LEFT) \
<(cut -f 4 nCoV-2019.scheme.bed | sort -t _ -k 2 -g | grep RIGHT) \
> nCoV-2019.primer_pairs.tsv| Output | Description |
|---|---|
ivar_vcf, ivar_vcf_index |
Variants and index output by iVar |
ivar_assembly |
Genome assembly generated by iVar |
freebayes_vcf, freebayes_vcf_index |
Variants and index output by Freebayes |
freebayes_assembly |
Genome assembly generated by Freebayes |
summary |
Pipeline metrics |
lineage_metadata |
Pangolin lineage assignment metadata |
bam |
Reads aligned to the SARS-CoV-2 reference genome |
Docker image definitions can be found in our bioinformatics-public-docker-images repo.
All containers are publicly hosted in DNAstack's container registry.
N.B. that the SIGNAL Docker container is ~10 GB to allow it to be used at scale in AWS, where EBS auto-scaling can sometimes not expand rapidly enough to accomodate running hundreds of samples in parallel. Including reference data in the Docker container seems to solve this issue, but does make it somewhat unwieldly.