`nbseq`: Python library for processing nanobody (VHH) sequencing data.

This package is intended to be used in two ways:

by the Snakemake workflows in the phage-seq repository to batch process raw sequencing data from Phage-seq experiments into feature tables. For this usage, the Snakemake workflows will install nbseq automatically as needed at the appropriate steps/
interactively within Jupyter notebooks to query, calculate, and visualize the resulting data structures.

To explore the code used in our paper (Grun et al., Nat. Commun. 2024), start with the phage-seq repository. That repository also includes several demonstration notebooks and datasets to explore the functionality of this library. Follow the instructions there to create or obtain an example dataset, then return to this repository for instructions on how to install nbseq for interactive analysis.

Installation and usage

First, perform preprocessing of raw data using the Snakemake workflow(s) in the phage-seq repository, following instructions there. The relevant steps within the Snakemake workflows will install the nbseq package; it is not necessary to manually install the nbseq package for this step.
Second, for interactive analysis, it is recommended to create a dedicated conda environment for use with the nbseq package.
1. If you have not already done so, install the Mamba (or Conda) package manager. I recommend using the miniforge distribution.
2. Create and activate a new conda environment for nbseq and its dependencies. You have two options:
  - Minimal installation: installs only the required core dependencies:
```
  wget https://github.com/caseygrun/nbseq/raw/main/environment-min.yaml
  conda env create -f environment-min.yaml
  conda activate nbseq-min
```
  - Full installation of all optional dependencies:
```
  wget https://github.com/caseygrun/nbseq/raw/main/environment.yaml
  conda env create -f environment.yaml
  conda activate nbseq
```
  In both cases, you do not need to clone this repository. You only need to download the .yaml file(s) using the steps above; the remaining files will be downloaded and installed by conda.
3. Install JupyterLab or the Jupyter Notebook, if you have not already; you also have two choices for this:
  - I recommend creating a separate dedicated conda environment for JupyterLab and using the nb_conda_kernels package; this will allow you to install and update JupyterLab separately from nbseq and its many dependencies; the nb_conda_kernels package lets you access the nbseq environment (and any other conda environments you create) from within JupyterLab:
```
  conda deactivate 
  conda create -n jupyter jupyterlab nb_conda_kernels panel
  conda activate jupyter
```
  - Alternatively, you can install JupyterLab directly into the same environment as nbseq:
```
  conda install jupyterlab
```
4. Launch JupyterLab and follow the instructions below in "Usage:"
```
jupyter lab
```

Note: nbseq is tested only on 64-bit Linux.

Usage

The main entry point for interactive analysis is the nbseq.Experiment class, which loads and organizes feature tables, phylogenetic trees, metadata, and databases for a given experiment. nbseq.Experiment.from_files can load data from the directory structure created by the phage-seq Snakemake workflows. Consult the docstring ?nbseq.Experiment.from_files for a more detailed description of the options.

>>> import nbseq
>>> ex = nbseq.Experiment.from_files(
...     # skip loading the larger `aa` (e.g. each VHH amino acid sequence is a 
...     # distinct column) feature table and # phylogenetic tree; by default, 
...     # the function loads the `cdr3` and `aa` feature tables
...     ft_aa=None, tree_aa=None, 
...     metadata='config/metadata_full.csv') #'intermediate/cdr3/features/all/alpaca/asvs.nwk')
Loading experiment panning-extended from '/vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended'...
- Reading metadata from config/metadata_full.csv ...
- Reading phenotypes from config/phenotypes.csv ...
- Reading Config from config/config.yaml ...
- Using SQL database at 'sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db'
- Reading feature data for table 'cdr3' from results/tables/cdr3/asvs.csv (2.6 MB)...
- Reading aa feature table from results/tables/aa/feature_table.biom (350.4 MB)...
- Reading cdr3 feature table from results/tables/cdr3/feature_table.biom (8.4 MB)...
- Warning: phylogeny for space 'aa' at 'intermediate/aa/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Warning: phylogeny for space 'cdr3' at 'intermediate/cdr3/features/top_asvs/alpaca/asvs.nwk' does not exist!
- Using mmseqs2 database 'aa' at 'intermediate/aa/features_db/features'
- Warning: mmseqs2 database for space 'cdr3' at 'intermediate/cdr3/features_db/features' does not exist!
- Reading enrichment model (conditional ECDF) for space cdr3 from results/tables/cdr3/enrichment/null/ecdf.pickle (307.6 kB)...
Finished in 20.29 seconds

Displaying the Experiment object shows a summary:

>>> ex
Experiment('panning-extended') with feature spaces ['aa', 'cdr3']:
obs: ['plate.x' 'well.x' 'depth' 'expt' 'round' 'sample' 'phage_library'
    'notes' 'r' 'io' 'kind' 'selection' 'replicate' 'name_full' 'name'
    'well_027e' 'sel_plate_027i' 'sel_well_027i' 'selection_027j' 'plate.y'
    'well.y' 'category' 'antigen' 'genotype_pair' 'gene_CS' 'gene_S'
    'genotype_CS' 'background_CS' 'strain_CS' 'loc_CS' 'cond_CS' 'genotype_S'
    'background_S' 'strain_S' 'loc_S' 'cond_S' 'cond_notes' 'bflm' 'swim'
    'twitch' 'swarm' 'PMB-R' 'FEP-R' 'TET-R' 'CIP-R' 'CHL-R' 'GEN-R' 'ERY-R'
    'IPM-R' 'cdiGMP' 'FliC' 'FliCa' 'FliCb' 'FlgEHKL' 'PilQ' 'PilA' 'PilB'
    'LasA' 'LasB' 'Apr' 'XcpQ' 'ToxA' 'EstA' 'LepA' 'PlpD' 'Phz' 'Pcn' 'Pvd'
    'Hcn' 'Rhl' 'T3SS' 'T6SS' 'Pel' 'Psl' 'CdrB' 'SCV' 'Mucoid' 'Alginate'
    'OprM' 'OprJ' 'OprN' 'OprOP' 'OpdH' 'OprD' 'OprL' 'OprF' 'OprG' 'OprH'
    'OprB' 'MexAB' 'MexCD' 'MexEF' 'MexJK' 'MexXY' 'MexGHI' 'PirA' 'Pfu'  'TonB'
    'FptA' 'FpvA' 'PfeA' 'CupB5' 'CupA' 'CupB' 'CupC' 'CupD'  'LPS-LipidA-
    Palmitoyl' 'L-LipidA-Ara4N' 'LPS-CPA' 'LPS-OSA' 'LPS-galU'  'LPS-rough'
    'LPS' 'description']
- aa      : 439 samples x 5134305 features, database: None
var: ['reads' 'nsamples']
- cdr3    : 439 samples x 40292 features, database: None
var: ['reads' 'nsamples']
SQL: sqlite:////vast/palmer/home.mccleary/cng2/code/phageseq-paper/panning-extended/intermediate/aa/asvs.db

From there, you can access various visualizations via the experiment visualizer, ex.viz, e.g.:

>>> ex.viz.top_feature_barplot(f"expt == '027j' & FlgEHKL == 1", select_from_round=None, n=100).facet(column='selection')

Or load additional interactive visualizations using the nbseq.viz package, e.g.

>>> import nbseq.viz.dash
>>> nbseq.viz.dash.selection_group_dashboard(
...     ex, starting_phenotype='FlgEHKL', 
...     global_query=(
...         "expt == '027j' & io == 'i' & kind == '+'")
... )

See the phage-seq repository for additional examples: panning-minimal and panning-extended.

Package organization

The nbseq package contains the following sub-modules:

nbseq: Experiment class that collects and organizes data for one or more Phage-seq experiments. Namely, Experiment loads and organizes trees, metadata, and feature tables in multiple feature spaces (e.g. VHH, CDR3, etc.) and facilitates projecting between them. Experiment also provides an interface for interactive visualization of the entire experiment or subsets thereof.
utils: utility functions
asvs: process VHH sequences: calculate residue frequencies, consensus sequences, query for similar sequences, project between feature spaces (e.g. CDR3 counts to full length amino acid sequence counts)
ft: read and process feature tables (sparse matrices of sample x feature [i.e. VHH, CDR3, etc.])
select: perform calculations relevant to phage display selection (e.g. enrichment, amplification bias); calculate null models of enrichment probabilities
norm: normalize feature table data to remove effect of variable library sizes
ordination: perform ordination/dimensionality reduction on feature tables
design: create design matrices for inference and machine learning
pheno: compare and visualize phenotypes of samples
msa: perform multiple sequence alignment with mafft
viz: generate various visualizations: feature bar plots, rank-abundance curve (Whittaker plots), abundance curves, 2D/3D ordination plots, sequence logos, receiver-operator characteristic curves, etc.
predict: perform machine learning prediction on feature tables
resynth: choose and resynthesize recombinant VHH genes as gene fragments. Includes routines for identifying consensus sequences, trimming and adding adapter sequences, etc.
cloning: simulate cloning recombinant VHHs into destination vectors
prep: utilities to aid in HTS library preparation

Dependencies

Later versions may work but have not been tested.

Required and recommended dependencies can be installed using conda via the included environment.yml file

Required dependencies:
- anndata=0.9.2
- biom-format=2.1.15
- humanize=4.7.0
- natsort=8.4.0
- numpy=1.24.4
- pandas=2.0.3
- pysam=0.21.0
- pyyaml=6.0
- scikit-bio=0.5.9
- scipy=1.10.0
- statsmodels=0.14.0
Optional dependencies:
- For machine learning:
  - scikit-learn=1.3.0
  - scikit-optimize=0.9.0
  - xgboost=1.5.1
- For database-accelerated feature queries:
  - connectorx=0.3.1
  - mmseqs2=14.7e284
  - sqlalchemy=2.0.19
  - sqlite=3.42.0
- For recombinant sequence optimization and cloning:
  - dna_features_viewer=3.1.2
  - dnachisel=3.2.11
  - pydna=3.1.0
  - python-codon-tables=0.1.12
- For processing Sanger sequencing chromatograms:
  - bioconvert=1.1.1
- For visualizations:
  - altair==5.1.0.dev0
  - logomaker=0.8
  - matplotlib=3.7.2
  - plotly=5.16.0
  - pygments=2.16.1
  - plotnine=0.12.2
  - seaborn=0.12.2
  - patchworklib=0.6.3
  - pip: mnemonicode=1.4.5
- For interactive "dashboard" visualizations:
  - altair-transform=0.2.0
  - bokeh=3.2.2
  - ipykernel=6.25.1
  - ipywidgets=8.1.0
  - panel=1.2.1
- For normalization using scran package:
  - r=4.1
  - bioconductor-biomformat=1.22.0
  - bioconductor-scran=1.22.1

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
nbseq		nbseq
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment-min.yaml		environment-min.yaml
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`nbseq`: Python library for processing nanobody (VHH) sequencing data.

Installation and usage

Usage

Package organization

Dependencies

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

caseygrun/nbseq

Folders and files

Latest commit

History

Repository files navigation

nbseq: Python library for processing nanobody (VHH) sequencing data.

Installation and usage

Usage

Package organization

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

`nbseq`: Python library for processing nanobody (VHH) sequencing data.

Packages