ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Implementation of ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design by Lukia Mistryukova*, Vladimir Manuilov*, Konstantin Avchaciov*, and Peter O. Fedichev.

ProtoBind-Diff is a masked diffusion language model that generates target-specific, high-quality small-molecule ligands. Trained entirely without structural input, it enables structure-independent ligand design across the full proteome and matches structure-based methods in docking and Boltz-1 benchmarks.

If you have questions, feel free to open an issue or send us an email at lukiia.mistriukova@gero.ai, konstantin.avchaciov@gero.ai, vladimir.manuylov@gero.ai, and peter.fedichev@gero.ai.

You can also try out the model on Hugging Face Spaces.

Overview

This repository contains the evaluation toolkit and supporting data for ProtoBind-Diff. It is organized as follows:

`data/`

We selected 12 protein targets to benchmark molecular generation quality across different models.

One folder per model (pocket2mol/, targetdiff/, etc.) containing SMILES files of generated ligands.
bindingdb/ – sets of random molecules (used as reference inactive molecules).
bindingdb_active/ – sets of true active molecules.
actives_bindingdb_cl – clustered true actives; one representative per similarity cluster.
CrossDocked2020/ – cleaned PDB files and corresponding ligands for the targets.
fasta/ – FASTA sequences of the targets.

`notebooks/`

Jupyter notebooks to reproduce the paper figures: docking/Boltz-1 score distributions, interpretation of attention maps, chemical property distributions, and UMAP-based target specificity analysis. We also added allign.ipynb to show differences between canonical sequences and PDB receptor sequences.

`results/`

Raw docking and Boltz-1 ipTM score tables for each method.

`protobind_diff/`

Model code, train and inference scripts.

`scripts/`

Scripts to download and prepare data for training.

Usage

Setup Environment

Clone the current repo

git clone https://github.com/gero-science/ProtoBind-Diff.git

You can install the project locally in editable mode:

python -m venv protodiff_env
source protodiff_env/bin/activate    
pip install -e .

Then you'll be able to run:

protobind-infer #inference
protobind-train #train

Inference

This script generates potential binding molecules (in SMILES format) for a given protein sequence by computing the protein embeddings on-the-fly.

Note: The script automatically downloads the model checkpoint from Hugging Face Hub and uses the best available hardware.

You need to specify the protein sequence with --fasta_file or --sequence. And you are ready to run inference:

protobind-infer --fasta_file examples/input.fasta

View All Command-Line Options

--output_dir: Specify a different output folder (default: ../outputs).
--output: Change the output filename (default: generated_smiles.txt).
--n_batches: Set the number of generation batches (default: 5).
--batch_size: Set the number of molecules generated per batch (default: 10).
--checkpoint_path: Use a local model checkpoint file.
--tokenizer_path: Use a local tokenizer file.
--model_name: Specify the ESM model for embeddings (default: esm2_t33_650M_UR50D).
--cache: Set a custom cache folder for downloads (default: ../cache).
--sampling_steps: Set the number of steps during sampling (default: 250).
--lig_max_length: Set the max length of generated molecules (default: 170).
--nucleus_p: Set the value of the nucleus sampling parameter (default: 0.9).
--eta: Set the value of the probability of remasking (default: 0.1).

Model Training

Training the model involves a three-step pipeline: 1) downloading the dataset, 2) pre-processing the data to generate embeddings and similarity matrices, and 3) running the training script with a configuration file.

Step 1: Download Raw Data

First, download the necessary dataset files (protein/ligand pairs, tokenizer, etc.) from Hugging Face Hub.

Run the command:

python scripts/download.py --output_dir ./data/experiments/diffusion

This will create a directory containing the raw data.csv and other necessary files.

Step 2: Pre-process Data

Next, you need to generate protein embeddings and a molecular similarity matrix from the raw data. This script performs both of these tasks. Note: This is a computationally intensive step. Calculating the ESM embeddings requires a GPU and at least 64 GB of RAM (depending on the size of the categorical_mappings.json file).

Run the command:

python scripts/esm_and_sim_matrix.py --data_dir ./data/experiments/diffusion --out_dir ./data/experiments/diffusion

This uses the files in the --data_dir and saves the new files: all_prots_*.pt (embeddings) and all_smiles_sparse_*.npz (Tanimoto matrix).

Step 3: Start Training

You are ready to train your model. Training is configured using .yaml files located in the configs/ directory.

protobind-train -o ./experiment_dir --exp_dir_prefix experiment_name --yaml ./configs/masked_diffusion.yaml

This command uses the following arguments:

--yaml: Specifies the main model and data configuration file.
--output_dir: Defines the parent directory where all experiment results will be saved.
--exp_dir_prefix: Creates a specific folder for this run (e.g., ./experiments/my_first_run).

To tune the model, you can edit the parameters in configs/masked_diffusion.yaml. Details on all tunable hyperparameters can be found in protobind_diff/parsers.py.

Citations

If you use this code or the models in your research, please cite the following paper:

@article {Mistryukova2025.06.16.659955,
	author = {Mistryukova, Lukia and Manuilov, Vladimir and Avchaciov, Konstantin and Fedichev, Peter O.},
	title = {ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design},
	year = {2025},
	journal = {bioRxiv}
}

License

The code and model weights are released under CC BY-NC 4.0 license. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Overview

`data/`

`notebooks/`

`results/`

`protobind_diff/`

`scripts/`

Usage

Setup Environment

Inference

Model Training

Step 1: Download Raw Data

Step 2: Pre-process Data

Step 3: Start Training

Citations

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
data		data
examples		examples
notebooks		notebooks
paper/tables		paper/tables
protobind_diff		protobind_diff
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
graphical-abstract.png		graphical-abstract.png
pyproject.toml		pyproject.toml

License

gero-science/ProtoBind-Diff

Folders and files

Latest commit

History

Repository files navigation

ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Overview

data/

notebooks/

results/

protobind_diff/

scripts/

Usage

Setup Environment

Inference

Model Training

Step 1: Download Raw Data

Step 2: Pre-process Data

Step 3: Start Training

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`data/`

`notebooks/`

`results/`

`protobind_diff/`

`scripts/`

Packages