AAVolve

AAVolve: A tool for exploring directed evolution of AAV capsids using long-read sequencing data

AAVolve is a snakemake pipeline for analysing long-read, shuffled AAV capsid data. It supports Nanopore (including R2C2), PacBio (including HiFi) and Sanger data.

Quickstart

To run with a local installation of snakemake and apptainer, clone this repository and use:

snakemake --use-apptainer --cores 1 --config read_file=<path to fastq> parent_file=<path to fasta> seq_tech=<np, np-cc, pb or sg> sample_name=<name_for_sample>

If singularity is installed instead of the newer apptainer, replace the --use-apptainer flag with --use-singularity:

snakemake --use-singularity --cores 1 --config read_file=<path to fastq> parent_file=<path to fasta> seq_tech=<np, np-cc, pb or sg> sample_name=<name_for_sample>

For Nanopore R2C2 data (np-cc), please also specify a path to a fasta file containing the splint:

snakemake --use-apptainer --cores 1 --config read_file=<path to fastq> parent_file=<path to fasta> seq_tech=<np, np-cc, pb or sg> splint_file=<path_to_fasta> sample_name=<name_for_sample>

Look for outputs in the out folder.

Setup and usage

AAVolve is a snakemake pipeline. The easist way to use AAVolve is to tell snakemake to use apptainer (previously known as singularity) to automatically supply dependencies. For this, please first install snakemake (we recommend using conda/mamba] and apptainer. Note that Apptainer is only available for Linux.

With snakemake and apptainer installed, clone this repository, and change into the cloned directory:

git clone https://github.com/szsctt/AAVolve.git
cd AAVolve

AAVolve makes use of snakemake's --config or --configfile argument to specify the input files and parameters for analysis. At a minimum, specify a file containing reads (read_file; fgastq format), a file containg parental sequences (parent_file; fasta format), and the sequencing technology used (seq_tech; np for Nanopore, np-cc for Nanopore R2C2, pb for PacBio and sg for Sanger). For Nanopore R2C2 data, a file containing the splint sequence (splint_file; fasta format) must also be provided. Other optional config parameters can be specified as outlined below ('Config options'). To analyse one file, use the --config parameter to specify inputs on the command line; to analyse multiple files it may be more convenient to specify inputs in a comma-separated file as outlined in the section below ('Config options').

snakemake also requires the user to specify the number of cores to be used with the --cores argument.

To run AAVolve using one core, for one read file and using apptainer to supply dependencies:

snakemake --use-apptainer --cores 1 --config read_file=<path to fastq> parent_file=<path to fasta> seq_tech=<np, np-cc, pb or sg> splint_file=<path_to_fasta> sample_name=<name_for_sample>

Where the splint_file argument can be omitted for non Nanopore R2C2 data.

See the snakemake documentation for more command line options, including cluster execution and profiles.

If singularity is installed instead of the newer apptainer, replace the --use-apptainer flag with --use-singularity:

snakemake --use-singularity --cores 1 --config read_file=<path to fastq> parent_file=<path to fasta> seq_tech=<np, np-cc, pb or sg> splint_file=<path_to_fasta> sample_name=<name_for_sample>

Config options

For more than one sample, specify parameters and inputs in a comma-sepearated file with one row per sample (fastq file), and the following columns:

sample_name,parent_name,reference_name,seq_tech,min_reps,read_file,parent_file,reference_file,splint_file,non_parental_freq,trim,adapter_5,adapter_3,anchors

sample_name: A name for the sample
parent_name (optional): A name for the parents of the sample
reference_name (optional): A name for the parent to be used as the reference
seq_tech: The sequencing technology used: either np for Nanopore, np-cc for Nanopore RCA/R2C2, pb or pb-hifi for PacBio, or sg for sanger sequencing (fasta format).
min_reps (optional): For RCA data, the minimum number of repeats for a read to be included (default: 3)
read_file: The path to the fastq (or fasta) file containing the data for the sample. For PacBio HiFi data, provide the consensus (ccs) reads.
parent_file: The path to the fasta file containing the parental sequences used for the shuffling experiment
reference_file (optional): The path to the fasta file containing one of the parental sequences to be used as a reference for alingment (default: the first parental sequence)
splint_file (required for RCA data): The path to the fasta file containing the splint used for circularization during RCA
non_parental_freq (optional): Identify non-parental variants seen in more than this fraction of the reads (default: 0.2)
include_non_parental(optional): Include non-parental variants if they are seen in more than non_parental_freq reads. When analysing multiple samples that share the same parent_file and reference_file, AAVolve will combine the high-frequency non-parental variants across those samples (each using its own non_parental_freq) and include the union in all samples in the group. (default: False)
non_parental_group (optional): When include_non_parental is True, set this to control which samples share non-parental variants. Samples only share variants within matching parent_file/reference_file groups, and within the same non_parental_group. Use this to run the same underlying reads with different parameters (e.g. different non_parental_freq thresholds) without cross-contaminating the non-parental variant sets between runs.
require_end_to_end_alignment (optional): If True, require reads to align end-to-end to the reference when extracting variants (i.e. full reference coverage), even when multiple parents are provided. This can reduce the number of reads retained; leave as False to only require coverage from the first to last parental variant. When include_non_parental is enabled and high-frequency non-parental variants fall outside the parental window, AAVolve will emit a warning in the sample report. (default: False)
group_vars (optional): During assignment of parents, group variants and assign parent with lowest hamming distance to read. If False, variants are considered individually (default: True)
Note: group_vars and include_non_parental cannot both be True, because grouped parent-assignment assumes each variant group can be explained by parental haplotypes; including non-parental variants in grouping would produce ambiguous/NA-heavy assignments.
group_vars_dist (optional): When grouping variants, combine non-adjacent variants that are at most this far apart (defult: 4)
max_group_distance (optional): When grouping variants, assign a parent if there are at most this fraction of the variants that differ between a parent and the read. For example, in a group of 6 variants, if this parameter is set to 0.2, a parent that has one variant that differs from the read will still be assigned to the group. Note that if group_vars is False, variants must match a parent otherwise they will be set to NA, regardless of the value for this parameter (default: 0.2).
minimap2_params (optional): Additional parameters to pass to minimap2 during alignment. If a preset (-x), end bonus (--end-bonus) or mismatch penalty (-B) are not specified, the defaults -x map-hifi, -B 1.5 and --end-bonus 5 will be used. (default: -x map-hifi -B 1.5 --end-bonus 5)
trim (optional): Set to True to enable adapter trimming for a sample. If enabled, both adapter_5 and adapter_3 must be provided. Trimming is never performed for parent samples. (default: False)
adapter_5 (required if trim is True): The 5' adapter sequence for trimming.
adapter_3 (required if trim is True): The 3' adapter sequence for trimming.
anchors (optional): Positive integer specifying the length of random 5' and 3' anchors to prepend/append to both reads and references prior to alignment. Omit or set to 0 to disable anchoring (default: 0).

To provide this file to snakemake:

snakemake --use-apptainer --cores 1 --config samples=<path to csv>

If you have issues during rendering ipynb notebookes, try adding --apptainer-args="--cleanenv" (or --singularity-args="--cleanenv").

Note that if the data is outside of the AAVolve directory, you will also need to add a bind-mount to the apptainer args: --apptainer-args="--cleanenv -B /path/to/data"

Trimming

C3POa often leaves pieces of the splint sequence at the ends of the consensus reads. Use trimming to remove these - it's useful to first run without trimming and do a multiple sequence alignment of the highest-count sequences with some of the parents to identify the adapter sequences to trim.

To help determine if trimming is necessary, a multiple sequence alignment is created containing the reference sequence and the first 200 reads in out/qc/trimming/<sample_name>.pretrim.mafft.fasta. If there are extra bases at the start and end of the reads (before and after the reference sequence), it may be beneficial to trim these. Use the first 10 or so bases before and after the reference sequence as the trimming sequences. If trimming is enabled, a post-trimming MSA is also created at out/qc/trimming/<sample_name>.mafft.fasta to confirm trimming worked as expected.

Anchors

Short random anchors can stabilise alignments for highly similar parental sequences. When anchors is set to a positive integer, AAVolve generates reproducible 5' and 3' anchor sequences of the specified length (stored in out/anchors/{sample}.fasta) and applies them to both the reads and the reference prior to alignment. Variant positions are reported relative to the original (unanchored) reference coordinates, and anchored FASTA/FASTQ files are written to out/anchors/reads/ and out/anchors/references/.

Outputs

AAVolve produces several outputs, which will appear in the out folder and may be useful depending on the library to be analysed. Note that most outputs are tabular, but are compressed with gzip and will require decompression before opening in a spreadsheet reader (e.g. Microsoft Excel).

Overview

Summary of results: out/reports/sample_reports/<sample_name>_report.html
Summary of results grouped by parent/reference: out/reports/group_reports/<pair_id>_report.html
Number of reads at each stage of processing: out/qc/<sample_name>_read-counts.tsv
Trimming QC MSAs: out/qc/trimming/<sample_name>.pretrim.mafft.fasta and (if trimming enabled) out/qc/trimming/<sample_name>.mafft.fasta
Generated anchor sequences: out/anchors/<sample_name>.fasta
Anchored reads and references used for alignment (when anchors enabled): out/anchors/reads/<sample_name>.*, out/anchors/references/<sample_name>.fasta

Unique sequences

Several files contain the unique sequences observed in the dataset, after combining and counting sequences that are identical when error-corrected (that is, only considering positions in the read that differ between parental sequences).

Counts of unique sequences at the amino acid level: out/corrected/counts/<sample_name>_aa-seq-counts.tsv.gz
Counts of unique sequences at the nucleotide level: out/corrected/counts/<sample_name>_nt-seq-counts.tsv.gz
Counts of unique combinations of parents: out/parents/counts/<sample_name>_parent-counts.tsv.gz
Distance matrix of highest-count 1000 unique sequences at amino acid level: out/corrected/dmat/<sample_name>_first_aa-seq.tsv.gz and out/corrected/dmat/<sample_name>__first_aa-seq.png
Distance matrix of randomly selected 1000 unique sequences at amino acid level: out/corrected/dmat/<sample_name>__first_aa-seq.tsv.gz and out/corrected/dmat/<sample_name>_first_aa-seq.png
Distance matrix of highest-count 1000 unique sequences at nucleotide level: out/corrected/dmat/<sample_name>_first_nt-seq.tsv.gz and out/corrected/dmat/<sample_name>__first_aa-seq.png
Distance matrix of randomly selected 1000 sequences at nucleotide level: out/corrected/dmat/<sample_name>_first_nt-seq.tsv.gz and out/corrected/dmat/<sample_name>_first_aa-seq.png

Variants

Variants are positions in the read where parents or reads differ in their sequence from the reference. AAVolve generally only considers variants that originated from a parental sequence, or are very frequent in the dataset, but all variants are available for analysis if required. Each variant has a position in the reference (which is the 0-based coordinate of the variant) and a type (sub[stitution], del[etion], ins[ertion]).

Observed frequency of all variants in reads: out/variants/frequency/<sample_name>_all.tsv.gz
Observed frequency of parental variants: out/variants/frequency/<sample_name>_high.tsv.gz
Observed frequency of high-frequency variants (those with a frequency above non_parental_freq parameter): out/variants/frequency/<sample_name>_high.tsv.gz
Observed frequency of parents at each variant position: out/prarents/freqs/<sample_name>_assigned-parents.tsv.gz
Variants in each read (sequence): out/pivot/<sample_name>_seq.tsv.gz
Assigned parents at each variant position in each read: out/parents/assigned/<sample_name>_assigned-parents.tsv.gz

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
.github		.github
aavolve		aavolve
deps		deps
rules		rules
tests		tests
.gitignore		.gitignore
.pytest.ini		.pytest.ini
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
Snakefile		Snakefile
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AAVolve

Quickstart

Setup and usage

Config options

Trimming

Anchors

Outputs

Overview

Unique sequences

Variants

About

Uh oh!

Releases 2

Packages

Contributors 3

Uh oh!

Languages

License

szsctt/AAVolve

Folders and files

Latest commit

History

Repository files navigation

AAVolve

Quickstart

Setup and usage

Config options

Trimming

Anchors

Outputs

Overview

Unique sequences

Variants

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Uh oh!

Languages

Packages