BGC-MAC and BGC-MAP: Attention-Based Models for Accurate Classification and Product Matching of Biosynthetic Gene Clusters
- Train with pre-computed ESM embedding
- Minimum VRAM: 1GB VRAM
- GPU Compatibility: Consumer-grade graphics cards are sufficient for training
- BGC-MAC Training: Can be trained using CPU only
-
Inference
(1) ESM Embedding Calculation:
-
VRAM Requirement: 13GB VRAM for ESM-650M
(2) Model Inference:
-
VRAM Requirement: <1GB VRAM sufficient
-
Hardware Compatibility: Consumer-grade graphics cards or CPU
-
CPU Inference: Fully supported for all models (If you have pre-computed ESM-embedding, ortherwise CPU inference is not recommended)
-
Performance Benchmarks
Training Time on RTX 4090 (9-Ensemble Models):
- BGC-MAC: Approximately 15-20 minutes
- BGC-MAP: Approximately 2.5 hours
conda env create -f environment.yml
conda activate natural_product
# Install local package.
# Current directory should be natural_product/
pip install -e .Training data and checkpoint files can be downloaded at Zenodo .
- Download MIBiG raw data
wget https://dl.secondarymetabolites.org/mibig/mibig_json_4.0.tar.gz
wget https://dl.secondarymetabolites.org/mibig/mibig_gbk_4.0.tar.gzAfter downloading, extract the files and place them in the ./data directory. Rename the directories to exactly:
mibig_gbk_4.0mibig_json_4.0
- Download pfam hmm file
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gzExtract the file and place them in the ./data directory.
Ensure the path is /data/Pfam-A.hmm
-
Download ESM2 weights (see Usage Instructions-Prerequisites part)
-
Download
COCONUT_DB_absoluteSMILES.smiand place them in the./data/natural_productdirectory. -
Then run:
cd data_preparation
python extract_BGCs.py
# Will generate three files:
# ./data/BGC_4.0/MAC_metadata.pkl
# ./data/BGC_4.0/BGC_gene_kind.pkl
# ./data/BGC_4.0/MAP_metadata.pkl
python pfam_annotation.py
# Will generate one file: ./data/BGC_4.0/BGC_domain_pfam.pkl
python esm2_emb_cal.py
# Will generate one file: ./data/BGC_4.0/Esm2_rep_mibig.pth-
If you skip data preprocessing step, please download
BGC_4.0.zipand place it in./datadirectoryBefore training, make sure
./data/BGC_4.0has the following five files:BGC_domain_pfam.pklBGC_gene_kind.pklEsm2_rep_mibig.pthMAC_metadata.pklMAP_metadata.pkl
-
Download the following test files and place them in:
./ckpt/BGC_MAC/test_MAC_metadata_43.pkl./ckpt/BGC_MAP/test_MAP_metadata_42.pkl
-
Execute experiments:
cd experiment
# Training
python train.py --model MAC
python train.py --model MAP
# Evaluation (replace 'your_ckpt_dir_name' with actual directory)
python test.py --model MAC --ckpt your_ckpt_dir_name --mean_result
python test.py --model MAP --ckpt your_ckpt_dir_name --mean_result
# To evaluate individual models in an ensemble (without mean results):
python test.py --model MAC --ckpt your_ckpt_dir_name
python test.py --model MAP --ckpt your_ckpt_dir_name
Run the notebook in /visualization directory to reproduce figures in the paper.
| Filename | Directory |
|---|---|
| MAC_test_ensemble.pkl | \ckpt\MAC_default |
| MAC_test_individual.pkl | \ckpt\MAC_default |
| MAP_test_ensemble.pkl | \ckpt\MAP_default |
| MAP_test_individual.pkl | \ckpt\MAP_default |
| antismash_annotation.zip | \data\antismash_annotation |
| deepbgc.zip | \data\deepbgc |
| NPatlas_ECFP.pkl | \data\natural_product |
| knownclusterblast_MACtest_hits.csv | \data\border |
| NPAtlas_download_2024_09.tsv | \data\natural_product |
| output.zip | \output |
| smiles.pkl | \data |
- all fungi bgc
-
Query antiSMASH-db with {[superkingdom|Eukaryota]}.
-
Download GenBank file and place them in
\data\fungi_bgc.
- BGC border analysis
-
Query antiSMASH-db with knownclusterblast with similarity greater than and equal to 100.
-
Download GenBank file and place them in
\data\border.
-
Download pretrained ESM2 weights from https://zenodo.org/records/7566741
Place both files in
./data -
Place checkpoint files in:
ckpt/BGC_MAC/MAC_default/MAC_default.ckptckpt/BGC_MAP/MAP_default/MAP_default.ckpt
Make sure: The filename stem (without extension) of a .ckpt file must match the name of its parent directory
For BGC input, BGC-MAC and BGC-MAP accept BGC GenBank files in antiSMASH format. There are no explicit size limitations on the input BGC. However, longer inputs (>1,000 domains) may lead to increases in memory usage and inference time .
If you use deepBGC or Sanntis to detect BGC, these tools will provide a JSON file that, when used as input to antiSMASH alongside the original genome file, generates an antiSMASH-annotated version of the BGCs (See https://docs.antismash.secondarymetabolites.org/sideloading/).
Users can check the Upload extra annotations box in antiSMASH webserver, and upload the json file alongside the original input genome file (Genbank or fasta format) for other tools. The resulting output GenBank file is compatible with BGC-MAC and BGC-MAP.

Both model will output a csv file containing the prediction score of each class or each given BGC-product pair. For BGC/product ranking task, users need to manually sort the prediction scores.
- Single BGC classification: Directly provide GBK file path
- Multiple BGC classification: Provide directory path (automatically processes all GBK files)
cd experiment
#Multiple BGC classification
python predict_new.py --gbk ../data/mibig_gbk_4.0/49.gbk
#Single BGC classification
python predict_new.py --gbk ../data/example- Single product matching: Provide SMILES string directly
- Multiple product matching: Provide a pickle file
containing:
- A list with length matching the number of GBK files
- Each element is a sublist representing products to match for that BGC
- BGC_ranking: Provide the directory path for all candidate BGC and a SMILES string.
Example format for 4 BGCs:
smiles = [
["CCO", "C=O"],
["C1=CC=CC=C1", "CCN"],
["O=C(O)C", "C#N"],
["CC(C)=O", "CCl"]
]cd experiment
#Single product matching
python predict_new.py --gbk ../data/example/BGC0001178.gbk --smiles "O=C1N[C@@H](C2=CC(O3)=CC(OS(O)(=O)=O)=C2)C(N[C@@H](C(N[C@@H]45)=O)C6=CC(OC7=C(Cl)C=C(C[C@@H]1NC(C(C8=CC3=C(O)C(Cl)=C8)=O)=O)C=C7)=C(O[C@@H]9[C@H](OC%10O[C@@H](C)[C@@H](O)[C@](N)(C)C%10)[C@@H](O)[C@H](O)[C@@H](CO)O9)C(OC%11=CC=C([C@@H](O)[C@H](NC4=O)C(N[C@@H](C(O)=O)C%12=CC(O)=CC(O)=C%12C%13=CC5=CC(Cl)=C%13O)=O)C=C%11Cl)=C6)=O"
#Multiple product matching (Ensure example directory have 4 GenBank file)
python predict_new.py --gbk ../data/example --smiles ../data/smiles.pkl
- BGC/Product ranking
- BGC: Provide a single BGC GenBank file, or a directory path containing multiple BGC GenBank files
- Product: Provide a SMILES string, or provide a pickle file contianing a list with several SMILES strings.
Example:
smiles = ["C1=CC=CC=C1", "CCN"] # For ranking. Can save as pickle file.cd experiment
# You should replace example with your own file or directory
# Single BGC query Product candidates
python predict_new.py --gbk ../data/natural_product/example/BGC0000448.gbk --ckpt MAP_default --smiles ../data/natural_product/smiles_example.pkl --esm_cache ../data/BGC_4.0/Esm2_rep_mibig.pth --ranking
# Single product query BGC candidates
python predict_new.py --gbk ../data/example --ckpt MAP_default --smiles "C[C@H](C(=O)N[C@]1(CN(C1=O)S(=O)(=O)O)OC)NC(=O)CC[C@H](C(=O)O)N" --ranking
# Multiple BGC query mutiple product candidates
python predict_new.py --gbk ../data/example --ckpt MAP_default --smiles ../data/natural_product/smiles_example.pkl --esm_cache ../data/BGC_4.0/Esm2_rep_mibig.pth --rankingDuring inference, the script will compute ESM2-650M embeddings and save as pkl file in data/cache directory. When you run a large scale prediction for a second time, use --esm_cache path_to_pkl_file to load these embeddings.
Note: Cache file is a dictionary, where key is the basename of GenBank File (without extension), and value is ESM embedding.