This package is intended to be used in two ways:
-
by the Snakemake workflows in the
phage-seqrepository to batch process raw sequencing data from Phage-seq experiments into feature tables. For this usage, the Snakemake workflows will installnbseqautomatically as needed at the appropriate steps/ -
interactively within Jupyter notebooks to query, calculate, and visualize the resulting data structures.
To explore the code used in our paper (Grun et al., Nat. Commun. 2024), start with the phage-seq repository. That repository also includes several demonstration notebooks and datasets to explore the functionality of this library. Follow the instructions there to create or obtain an example dataset, then return to this repository for instructions on how to install nbseq for interactive analysis.
-
First, perform preprocessing of raw data using the Snakemake workflow(s) in the
phage-seqrepository, following instructions there. The relevant steps within the Snakemake workflows will install thenbseqpackage; it is not necessary to manually install thenbseqpackage for this step. -
Second, for interactive analysis, it is recommended to create a dedicated
condaenvironment for use with thenbseqpackage.-
If you have not already done so, install the Mamba (or Conda) package manager. I recommend using the
miniforgedistribution. -
Create and activate a new
condaenvironment fornbseqand its dependencies. You have two options:-
Minimal installation: installs only the required core dependencies:
wget https://github.com/caseygrun/nbseq/raw/main/environment-min.yaml conda env create -f environment-min.yaml conda activate nbseq-min -
Full installation of all optional dependencies:
wget https://github.com/caseygrun/nbseq/raw/main/environment.yaml conda env create -f environment.yaml conda activate nbseq
In both cases, you do not need to clone this repository. You only need to download the
.yamlfile(s) using the steps above; the remaining files will be downloaded and installed byconda. -
-
Install JupyterLab or the Jupyter Notebook, if you have not already; you also have two choices for this:
-
I recommend creating a separate dedicated
condaenvironment for JupyterLab and using thenb_conda_kernelspackage; this will allow you to install and update JupyterLab separately fromnbseqand its many dependencies; thenb_conda_kernelspackage lets you access thenbseqenvironment (and any other conda environments you create) from within JupyterLab:conda deactivate conda create -n jupyter jupyterlab nb_conda_kernels panel conda activate jupyter -
Alternatively, you can install JupyterLab directly into the same environment as
nbseq:conda install jupyterlab
-
-
Launch JupyterLab and follow the instructions below in "Usage:"
jupyter lab
-
Note: nbseq is tested only on 64-bit Linux.
The main entry point for interactive analysis is the nbseq.Experiment class, which loads and organizes feature tables, phylogenetic trees, metadata, and databases for a given experiment. nbseq.Experiment.from_files can load data from the directory structure created by the phage-seq Snakemake workflows. Consult the docstring ?nbseq.Experiment.from_files for a more detailed description of the options.
>>> import nbseq
>>> ex = nbseq.Experiment.from_files(
... # skip loading the larger `aa` (e.g. each VHH amino acid sequence is a
... # distinct column) feature table and # phylogenetic tree; by default,
... # the function loads the `cdr3` and `aa` feature tables
... ft_aa=None, tree_aa=None,
... metadata='config/metadata_full.csv') #'intermediate/cdr3/features/all/alpaca/asvs.nwk')
Loading experiment panning-extended from '/vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended'...
- Reading metadata from config/metadata_full.csv ...
- Reading phenotypes from config/phenotypes.csv ...
- Reading Config from config/config.yaml ...
- Using SQL database at 'sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db'
- Reading feature data for table 'cdr3' from results/tables/cdr3/asvs.csv (2.6 MB)...
- Reading aa feature table from results/tables/aa/feature_table.biom (350.4 MB)...
- Reading cdr3 feature table from results/tables/cdr3/feature_table.biom (8.4 MB)...
- Warning: phylogeny for space 'aa' at 'intermediate/aa/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Warning: phylogeny for space 'cdr3' at 'intermediate/cdr3/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Using mmseqs2 database 'aa' at 'intermediate/aa/features_db/features'
- Warning: mmseqs2 database for space 'cdr3' at 'intermediate/cdr3/features_db/features' does not exist!
- Reading enrichment model (conditional ECDF) for space cdr3 from results/tables/cdr3/enrichment/null/ecdf.pickle (307.6 kB)...
Finished in 20.29 seconds
Displaying the Experiment object shows a summary:
>>> ex
Experiment('panning-extended') with feature spaces ['aa', 'cdr3']:
obs: ['plate.x' 'well.x' 'depth' 'expt' 'round' 'sample' 'phage_library'
'notes' 'r' 'io' 'kind' 'selection' 'replicate' 'name_full' 'name'
'well_027e' 'sel_plate_027i' 'sel_well_027i' 'selection_027j' 'plate.y'
'well.y' 'category' 'antigen' 'genotype_pair' 'gene_CS' 'gene_S'
'genotype_CS' 'background_CS' 'strain_CS' 'loc_CS' 'cond_CS' 'genotype_S'
'background_S' 'strain_S' 'loc_S' 'cond_S' 'cond_notes' 'bflm' 'swim'
'twitch' 'swarm' 'PMB-R' 'FEP-R' 'TET-R' 'CIP-R' 'CHL-R' 'GEN-R' 'ERY-R'
'IPM-R' 'cdiGMP' 'FliC' 'FliCa' 'FliCb' 'FlgEHKL' 'PilQ' 'PilA' 'PilB'
'LasA' 'LasB' 'Apr' 'XcpQ' 'ToxA' 'EstA' 'LepA' 'PlpD' 'Phz' 'Pcn' 'Pvd'
'Hcn' 'Rhl' 'T3SS' 'T6SS' 'Pel' 'Psl' 'CdrB' 'SCV' 'Mucoid' 'Alginate'
'OprM' 'OprJ' 'OprN' 'OprOP' 'OpdH' 'OprD' 'OprL' 'OprF' 'OprG' 'OprH'
'OprB' 'MexAB' 'MexCD' 'MexEF' 'MexJK' 'MexXY' 'MexGHI' 'PirA' 'Pfu' 'TonB'
'FptA' 'FpvA' 'PfeA' 'CupB5' 'CupA' 'CupB' 'CupC' 'CupD' 'LPS-LipidA-
Palmitoyl' 'L-LipidA-Ara4N' 'LPS-CPA' 'LPS-OSA' 'LPS-galU' 'LPS-rough'
'LPS' 'description']
- aa : 439 samples x 5134305 features, database: None
var: ['reads' 'nsamples']
- cdr3 : 439 samples x 40292 features, database: None
var: ['reads' 'nsamples']
SQL: sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db
From there, you can access various visualizations via the experiment visualizer, ex.viz, e.g.:
>>> ex.viz.top_feature_barplot(f"expt == '027j' & FlgEHKL == 1", select_from_round=None, n=100).facet(column='selection')
Or load additional interactive visualizations using the nbseq.viz package, e.g.
>>> import nbseq.viz.dash
>>> nbseq.viz.dash.selection_group_dashboard(
... ex, starting_phenotype='FlgEHKL',
... global_query=(
... "expt == '027j' & io == 'i' & kind == '+'")
... )
See the phage-seq repository for additional examples: panning-minimal and panning-extended.
The nbseq package contains the following sub-modules:
nbseq:Experimentclass that collects and organizes data for one or more Phage-seq experiments. Namely,Experimentloads and organizes trees, metadata, and feature tables in multiple feature spaces (e.g. VHH, CDR3, etc.) and facilitates projecting between them.Experimentalso provides an interface for interactive visualization of the entire experiment or subsets thereof.utils: utility functionsasvs: process VHH sequences: calculate residue frequencies, consensus sequences, query for similar sequences, project between feature spaces (e.g. CDR3 counts to full length amino acid sequence counts)ft: read and process feature tables (sparse matrices of sample x feature [i.e. VHH, CDR3, etc.])select: perform calculations relevant to phage display selection (e.g. enrichment, amplification bias); calculate null models of enrichment probabilitiesnorm: normalize feature table data to remove effect of variable library sizesordination: perform ordination/dimensionality reduction on feature tablesdesign: create design matrices for inference and machine learningpheno: compare and visualize phenotypes of samplesmsa: perform multiple sequence alignment withmafftviz: generate various visualizations: feature bar plots, rank-abundance curve (Whittaker plots), abundance curves, 2D/3D ordination plots, sequence logos, receiver-operator characteristic curves, etc.predict: perform machine learning prediction on feature tablesresynth: choose and resynthesize recombinant VHH genes as gene fragments. Includes routines for identifying consensus sequences, trimming and adding adapter sequences, etc.cloning: simulate cloning recombinant VHHs into destination vectorsprep: utilities to aid in HTS library preparation
Later versions may work but have not been tested.
Required and recommended dependencies can be installed using conda via the included environment.yml file
- Required dependencies:
- anndata=0.9.2
- biom-format=2.1.15
- humanize=4.7.0
- natsort=8.4.0
- numpy=1.24.4
- pandas=2.0.3
- pysam=0.21.0
- pyyaml=6.0
- scikit-bio=0.5.9
- scipy=1.10.0
- statsmodels=0.14.0
- Optional dependencies:
- For machine learning:
- scikit-learn=1.3.0
- scikit-optimize=0.9.0
- xgboost=1.5.1
- For database-accelerated feature queries:
- connectorx=0.3.1
- mmseqs2=14.7e284
- sqlalchemy=2.0.19
- sqlite=3.42.0
- For recombinant sequence optimization and cloning:
- dna_features_viewer=3.1.2
- dnachisel=3.2.11
- pydna=3.1.0
- python-codon-tables=0.1.12
- For processing Sanger sequencing chromatograms:
- bioconvert=1.1.1
- For visualizations:
- altair==5.1.0.dev0
- logomaker=0.8
- matplotlib=3.7.2
- plotly=5.16.0
- pygments=2.16.1
- plotnine=0.12.2
- seaborn=0.12.2
- patchworklib=0.6.3
- pip: mnemonicode=1.4.5
- For interactive "dashboard" visualizations:
- altair-transform=0.2.0
- bokeh=3.2.2
- ipykernel=6.25.1
- ipywidgets=8.1.0
- panel=1.2.1
- For normalization using
scranpackage:- r=4.1
- bioconductor-biomformat=1.22.0
- bioconductor-scran=1.22.1
- For machine learning: