Skip to content

Pangenome-Informed Genome Assembly (PIGA) workflow for population-scale diploid genome assembly

License

Notifications You must be signed in to change notification settings

JianYang-Lab/PIGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

196 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pangenome-Informed Genome Assembly (PIGA)

Introduction

This is a workflow called Pangenome-Informed Genome Assembly (PIGA) for population-scale diploid genome assembly, which employs a pangenome graph as a unified framework to integrate sequence information across individuals and perform joint diploid genome assembly.

Compared to the current assembly methods, the PIGA workflow fully utilizes multiple sources of information (long read, short read, internal haplotype, external assembly panel) and is well adapted to low-coverage and modest-coverage situations.

Installation & Setup

Step 1: Clone the Repository

First, clone the PIGA repository from GitHub:

git clone https://github.com/JianYang-Lab/PIGA.git
cd PIGA

Step 2: Create and Activate Conda Environment

PIGA uses Snakemake and requires several software dependencies. The easiest way to install them is by creating a dedicated conda environment from the provided YAML file.

# Create the conda environment named 'piga'
# We recommend using mamba for faster environment building.
conda env create -f environment.yaml -n piga

# Activate the environment to use PIGA
conda activate piga

Step 3: Build and Download Additional Tools

Some tools are not available via conda and need to be built from source or downloaded separately.

# Build and download additional tools
bash build.sh

Test Dataset and Configuration

Download the test dataset

# Download the test dataset
cd test_data
bash download.sh

Configure the workflow(config/config.yaml)

The configuration file should contain:

  • samples: Path to a text file listing all samples. The file should be tab-delimited, with:

    • Column 1: Sample name
    • Column 2: Sample sex
  • sr_fastqs: Paths to the paired-end short-read FASTQ files. Use {sample} as a wildcard.

  • lr_hifi_fastqs: Paths to PacBio HiFi long-read FASTQ files. Use {sample} as a wildcard.

  • lr_zmw_fastqs: Paths to PacBio ZMW FASTQ files (selected representative read for each ZMW). Use {sample} as a wildcard.

  • lr_subreads_bam: Paths to PacBio subreads BAM files. Use {sample} as a wildcard.

  • reference:

    • CHM13: Path to the CHM13 reference genome in FASTA format.
    • GRCh38: Path to the GRCh38 reference genome in FASTA format.
  • GATK_Resource: Paths to GATK reference resource datasets for quality control.

  • external_pangenome: Path to the external pangenome graph in GBZ format.

  • par_region: Path to the pseudoautosomal region (PAR) BED file for CHM13.

  • external_assembly_list: Path to a file listing external genome assemblies used for constructing the PIGA pangenome.

  • train_sample_list: Path to a file listing names of matched assemblies from training samples used for model training. The file should be tab-delimited, with:

    • Column 1: Names of PIGA draft assemblies from training samples
    • Column 2: Names of matched high-quality assemblies from training samples used as the truth
  • prefix: Prefix used for naming output files.

Running the Workflow

After setting up the configuration file (config/config.yaml) with your sample data, you can run the entire PIGA workflow. There are two main ways to execute it:

Option 1: Local Execution

This method is suitable for running PIGA on a single, powerful machine. It will use the specified number of cores on the local machine.

# Dry run test
snakemake --dry-run -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# Run PIGA locally.
# Recommended RAM: >300 GB (with 32 cores)
# Memory requirements scale with the number of cores utilized, allowing flexible resource allocation.
snakemake -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# The broken pipeline can be continued by rerunning with --unlock and --rerun-incomplete.
snakemake --unlock -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
snakemake --rerun-incomplete -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# Warnings about TensorRT and parse_sam_aux_fields from DeepVariant can be safely ignored.

Option 2: Cluster Execution (with Profile)

This method is designed for high-performance computing (HPC) environments and uses a job submission system (e.g., SLURM) to distribute the workload across a cluster.

# Run PIGA using a workflow profile to submit jobs to a cluster
snakemake -s Snakefile --cores 32 --jobs 32 --configfile config/config.yaml --profile ./profile/config_slurm/
# The broken pipeline can be continued by rerunning with --rerun-incomplete.
snakemake --rerun-incomplete -s Snakefile --cores 32 --jobs 32 --configfile config/config.yaml --profile ./profile/config_slurm/

Note: By default, we provide a profile which is configured to use the SLURM job scheduler. You can customize the cluster settings (e.g., switch to a different scheduler or change resource allocation) by editing the configuration file.

Output

The output file contain:

  • c1_call_sr_snv/merged_vcf/{config['prefix']}.gatk.variant_recalibrated.filter.vcf.gz: Short-read SNV callset

  • c2_call_lr_snv/merged_vcf/{config['prefix']}.deepvariant.whatshap.beagle.vcf.gz: Long-read SNV callset

  • c3_merge_snv/merged_vcf/{config['prefix']}.consensus.merfin.vcf.gz: SNV callset combining short-read and long-read callsets

  • c4_phase_snv/sample_vcf/{sample}/{sample}.shapeit.vcf.gz: Phased SNVs of each sample

  • c5_personal_ref/sample_reference/{sample}/{sample}.personal_ref.fasta: Personalized reference of each sample

  • c6_draft_assembly/sample_assembly/{sample}/assembly/{sample}.{hap1,hap2}.fasta: Draft diploid assembly of each sample

  • c7_graph_construction/subgraph/subgraph_{id}/{config['prefix']}_subgraph_{id}.seqwish.smoothxg.gfaffix.gfa: Constructed pangenome of each subgraph

  • c7_graph_construction/subgraph/subgraph_{id}/{config['prefix']}_subgraph_{id}.seqwish.smoothxg.gfaffix.ml_filter.variant_project.gfaffix.gfa: Simplified pangenome of each subgraph

  • c7_graph_construction/graph_merge/{config['prefix']}.merge.assembly.gbz: Merged pangenome across all subgraph pangenomes

  • c8_diploid_path_infer/sample_assembly/{sample}/{sample}.{hap1,hap2}.complete_assembly.polish.clip.fasta: Final diploid assembly of each sample

Documentation

PIGA consists of six modules, each containing several commands that can be executed step-by-step. A detailed tutorial is provided for each module. To independently run steps that require intermediate files from other steps as input, you can directly copy the needed files from the test_data/example_output directory, which is downloaded by download.sh.

  • call_sr_snv: detect SNVs using short reads.
  • call_lr_snv: detect SNVs using long reads.
  • merge_snv: merge the short-read SNV callset and long-read SNV callset.
  • phase_snv: perform SNV haplotype phasing leveraging long-read and population information.
  • generate_personal_reference: generate personalized reference by modifying the reference genome with homozygous variants genotyped from the external pangenome.
  • draft_assembly: partition long reads into haplotypes and produce draft diploid assemblies.
  • construct_pangenome: construct and refine the base-level pangenome.
  • simplify_pangenome: simplify the pangenome.
  • merge_pangenome: merge pangenome subgraphs into the final pangenome.
  • infer_diploid_path: reconstruct the final diploid assembly by inferring the diploid paths.

License

License: MIT License

About

Pangenome-Informed Genome Assembly (PIGA) workflow for population-scale diploid genome assembly

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •