Pangenome-Informed Genome Assembly (PIGA)

Introduction

This is a workflow called Pangenome-Informed Genome Assembly (PIGA) for population-scale diploid genome assembly, which employs a pangenome graph as a unified framework to integrate sequence information across individuals and perform joint diploid genome assembly.

Compared to the current assembly methods, the PIGA workflow fully utilizes multiple sources of information (long read, short read, internal haplotype, external assembly panel) and is well adapted to low-coverage and modest-coverage situations.

Installation & Setup

Step 1: Clone the Repository

First, clone the PIGA repository from GitHub:

git clone https://github.com/JianYang-Lab/PIGA.git
cd PIGA

Step 2: Create and Activate Conda Environment

PIGA uses Snakemake and requires several software dependencies. The easiest way to install them is by creating a dedicated conda environment from the provided YAML file.

# Create the conda environment named 'piga'
# We recommend using mamba for faster environment building.
conda env create -f environment.yaml -n piga

# Activate the environment to use PIGA
conda activate piga

Step 3: Build and Download Additional Tools

Some tools are not available via conda and need to be built from source or downloaded separately.

# Build and download additional tools
bash build.sh

Test Dataset and Configuration

Download the test dataset

# Download the test dataset
cd test_data
bash download.sh

Configure the workflow(config/config.yaml)

The configuration file should contain:

samples: Path to a text file listing all samples. The file should be tab-delimited, with:
- Column 1: Sample name
- Column 2: Sample sex
sr_fastqs: Paths to the paired-end short-read FASTQ files. Use {sample} as a wildcard.
lr_hifi_fastqs: Paths to PacBio HiFi long-read FASTQ files. Use {sample} as a wildcard.
lr_zmw_fastqs: Paths to PacBio ZMW FASTQ files (selected representative read for each ZMW). Use {sample} as a wildcard.
lr_subreads_bam: Paths to PacBio subreads BAM files. Use {sample} as a wildcard.
reference:
- CHM13: Path to the CHM13 reference genome in FASTA format.
- GRCh38: Path to the GRCh38 reference genome in FASTA format.
GATK_Resource: Paths to GATK reference resource datasets for quality control.
external_pangenome: Path to the external pangenome graph in GBZ format.
par_region: Path to the pseudoautosomal region (PAR) BED file for CHM13.
external_assembly_list: Path to a file listing external genome assemblies used for constructing the PIGA pangenome.
train_sample_list: Path to a file listing names of matched assemblies from training samples used for model training. The file should be tab-delimited, with:
- Column 1: Names of PIGA draft assemblies from training samples
- Column 2: Names of matched high-quality assemblies from training samples used as the truth
prefix: Prefix used for naming output files.

Running the Workflow

After setting up the configuration file (config/config.yaml) with your sample data, you can run the entire PIGA workflow. There are two main ways to execute it:

Option 1: Local Execution

This method is suitable for running PIGA on a single, powerful machine. It will use the specified number of cores on the local machine.

# Dry run test
snakemake --dry-run -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# Run PIGA locally.
# Recommended RAM: >300 GB (with 32 cores)
# Memory requirements scale with the number of cores utilized, allowing flexible resource allocation.
snakemake -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# The broken pipeline can be continued by rerunning with --unlock and --rerun-incomplete.
snakemake --unlock -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
snakemake --rerun-incomplete -s Snakefile --cores 32 --configfile config/config.yaml --profile ./profile/config_local/
# Warnings about TensorRT and parse_sam_aux_fields from DeepVariant can be safely ignored.

Option 2: Cluster Execution (with Profile)

This method is designed for high-performance computing (HPC) environments and uses a job submission system (e.g., SLURM) to distribute the workload across a cluster.

# Run PIGA using a workflow profile to submit jobs to a cluster
snakemake -s Snakefile --cores 32 --jobs 32 --configfile config/config.yaml --profile ./profile/config_slurm/
# The broken pipeline can be continued by rerunning with --rerun-incomplete.
snakemake --rerun-incomplete -s Snakefile --cores 32 --jobs 32 --configfile config/config.yaml --profile ./profile/config_slurm/

Note: By default, we provide a profile which is configured to use the SLURM job scheduler. You can customize the cluster settings (e.g., switch to a different scheduler or change resource allocation) by editing the configuration file.

Output

The output file contain:

c1_call_sr_snv/merged_vcf/{config['prefix']}.gatk.variant_recalibrated.filter.vcf.gz: Short-read SNV callset
c2_call_lr_snv/merged_vcf/{config['prefix']}.deepvariant.whatshap.beagle.vcf.gz: Long-read SNV callset
c3_merge_snv/merged_vcf/{config['prefix']}.consensus.merfin.vcf.gz: SNV callset combining short-read and long-read callsets
c4_phase_snv/sample_vcf/{sample}/{sample}.shapeit.vcf.gz: Phased SNVs of each sample
c5_personal_ref/sample_reference/{sample}/{sample}.personal_ref.fasta: Personalized reference of each sample
c6_draft_assembly/sample_assembly/{sample}/assembly/{sample}.{hap1,hap2}.fasta: Draft diploid assembly of each sample
c7_graph_construction/subgraph/subgraph_{id}/{config['prefix']}_subgraph_{id}.seqwish.smoothxg.gfaffix.gfa: Constructed pangenome of each subgraph
c7_graph_construction/subgraph/subgraph_{id}/{config['prefix']}_subgraph_{id}.seqwish.smoothxg.gfaffix.ml_filter.variant_project.gfaffix.gfa: Simplified pangenome of each subgraph
c7_graph_construction/graph_merge/{config['prefix']}.merge.assembly.gbz: Merged pangenome across all subgraph pangenomes
c8_diploid_path_infer/sample_assembly/{sample}/{sample}.{hap1,hap2}.complete_assembly.polish.clip.fasta: Final diploid assembly of each sample

Documentation

PIGA consists of six modules, each containing several commands that can be executed step-by-step. A detailed tutorial is provided for each module. To independently run steps that require intermediate files from other steps as input, you can directly copy the needed files from the test_data/example_output directory, which is downloaded by download.sh.

1. SNV Detection

call_sr_snv: detect SNVs using short reads.
call_lr_snv: detect SNVs using long reads.
merge_snv: merge the short-read SNV callset and long-read SNV callset.

2. SNV Haplotype Phasing

phase_snv: perform SNV haplotype phasing leveraging long-read and population information.

3. Personalized Reference Generation

generate_personal_reference: generate personalized reference by modifying the reference genome with homozygous variants genotyped from the external pangenome.

4. Draft Diploid Genome Assembly

draft_assembly: partition long reads into haplotypes and produce draft diploid assemblies.

5. Pangenome Construction and Simplification

construct_pangenome: construct and refine the base-level pangenome.
simplify_pangenome: simplify the pangenome.
merge_pangenome: merge pangenome subgraphs into the final pangenome.

6. Final Diploid Assembly Reconstruction

infer_diploid_path: reconstruct the final diploid assembly by inferring the diploid paths.

License

License: MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
config		config
docs		docs
profile		profile
rules		rules
scripts		scripts
test_data		test_data
LICENSE		LICENSE
PIGA.jpg		PIGA.jpg
README.md		README.md
Snakefile		Snakefile
build.sh		build.sh
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangenome-Informed Genome Assembly (PIGA)

Introduction

Installation & Setup

Step 1: Clone the Repository

Step 2: Create and Activate Conda Environment

Step 3: Build and Download Additional Tools

Test Dataset and Configuration

Download the test dataset

Configure the workflow(config/config.yaml)

Running the Workflow

Option 1: Local Execution

Option 2: Cluster Execution (with Profile)

Output

Documentation

1. SNV Detection

2. SNV Haplotype Phasing

3. Personalized Reference Generation

4. Draft Diploid Genome Assembly

5. Pangenome Construction and Simplification

6. Final Diploid Assembly Reconstruction

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

JianYang-Lab/PIGA

Folders and files

Latest commit

History

Repository files navigation

Pangenome-Informed Genome Assembly (PIGA)

Introduction

Installation & Setup

Step 1: Clone the Repository

Step 2: Create and Activate Conda Environment

Step 3: Build and Download Additional Tools

Test Dataset and Configuration

Download the test dataset

Configure the workflow(config/config.yaml)

Running the Workflow

Option 1: Local Execution

Option 2: Cluster Execution (with Profile)

Output

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages