A Sequence Similarity Network Approach for Fast Detection of Horizontal Gene Transfer Events
SimNet-HGT is a scalable computational framework for detecting horizontal gene transfer (HGT) events using sequence similarity networks (SSNs). By combining phylogenetic distance and sequence similarity with a ranking-based normalization, our method prioritizes high-similarity connections between evolutionarily distant lineages—a hallmark signature of horizontal gene transfer.
Horizontal gene transfer is a primary driver of microbial evolution, enabling the rapid spread of crucial traits such as antibiotic resistance and virulence factors. Traditional HGT detection methods based on gene tree–species tree reconciliation suffer from prohibitive
SimNet-HGT overcomes these limitations by:
-
Constructing Sequence Similarity Networks — Using MMseqs2 for rapid all-against-all sequence alignment with
$O(n)$ time complexity - Building Bipartite Graph Representations — Linking genomes to connected components (gene families) for efficient modularity analysis
- Applying Ranking-Based Scoring — Weighting phylogenetic distance by sequence similarity, normalized by similarity rank
- Statistical Validation — Using bootstrap analysis for principled identification of high-confidence HGT candidates
The framework achieves linear time complexity with respect to the number of input sequences, enabling analysis of datasets comprising thousands of taxa and millions of gene/protein sequences.
- Python 3.13+
- MMseqs2 — For sequence similarity search (installation guide)
# Clone the repository
git clone https://github.com/mrbakhtyari/simnet.git
cd simnet
# Install dependencies with uv
uv syncIf uv is not available, a standard virtual environment with pip also works.
python -m venv .venv
# activate the virtual environment
python -m pip install -e .To run the analysis on a dataset:
-
Prepare your dataset directory: Create a root directory for your dataset (e.g.,
datasets/my_dataset). Inside, create a00_inputfolder containing two files:-
inputfile: A text file containing all protein sequences.- Format:
- Line 1:
<count> <length> - Subsequent lines:
<organism_name> <protein_sequence>
- Line 1:
- Format:
-
tree.newick: A standard Newick format phylogenetic tree containing the organisms.- Note: Organism names in the tree must match (or be substrings of) the names in
inputfile.
- Note: Organism names in the tree must match (or be substrings of) the names in
-
-
Run the pipeline:
python scripts/run_pipeline.py datasets/my_dataset
Optional Arguments:
--min_coverage: Minimum alignment coverage--min_identity: Minimum sequence identity
-
Output: Results will be generated in the
my_datasetfolder, organized by step:06_hgt/hgt_scores_original_names.tsv: Final ranked HGT candidates.06_hgt/high_confidence_hgt_original_names.tsv: High confidence subset (if applicable).
This example demonstrates how to reproduce the results for the aminoacyl-tRNA synthetases (AARS) dataset reported in the paper, comprising 32 organisms spanning Bacteria, Archaea, and Eukarya (Woese et al., 2000).
The dataset is located in datasets/aminoacyl/00_input/.
- Species Tree (
tree.newick): - Protein Sequences (
inputfile): Contains 32 aminoacyl-tRNA synthetase sequences in Phylip format (header:32 171).
Run the pipeline with the specific parameters used in our analysis:
python scripts/run_pipeline.py datasets/aminoacylNote: This command uses the default parameters defined in the pipeline, which correspond exactly to the reported results:
- Min Coverage: 0.8
- Min Identity: 0.59
- Rank Threshold: 0.5
- Bootstrap Replicates: 10,000 (used for threshold calculation)
- Target Quantile: 95th percentile
- Successfully recovered known HGT events between Treponema pallidum–Pyrococcus horikoshii and Borrelia burgdorferi–Pyrococcus horikoshii
- Identified 11 high-confidence HGT candidates exceeding the bootstrap-derived significance threshold
- Discovered novel transfer candidates among bacterial lineages
This section will be updated upon paper acceptance.
- Mohammadreza Bakhtyari — Computer Science, Université du Québec à Montréal, Canada
- Nadia Tahiri — Computer Science, Université de Sherbrooke, Canada
- F. Guillaume Blanchet — Biological Science, Université de Sherbrooke, Canada
- Vladimir Makarenkov — Computer Science, Université du Québec à Montréal, Canada