Synthetic Data for More Accurate Deep Learning Models in Molecular Science: A Test Case of Protein-Ligand Binding Affinity Prediction

Deep learning models are data-hungry, and synthetic (artificial) data has been shown to be invaluable when data availability is low. While this has been demonstrated in certain technology areas, adopting such an approach is new in machine learning (ML) applications in chemistry, except for some pre-training tasks. In drug discovery, predicting binding energy between proteins and ligands is crucial. Many ML-based studies have been proposed to predict protein-ligand binding affinity using existing experimental data. However, it has been shown that these models suffer from inherent biases. Recent efforts have resulted in PLAS-20k, a synthetic dataset of multiple protein-ligand complex (PLC) conformations generated using molecular dynamics (MD) simulation as a viable option to be used along with existing experimental data to improve binding affinity prediction. For the binding affinity prediction task, we employ Pafnucy, a deep convolutional neural network, and we propose using multiple structures for each PLC from PLAS-20k to train it. We compare four different statistical and ML-based result-aggregation techniques. This work demonstrates the utility of dynamic datasets in enhancing binding affinity predictions, laying the foundations for future improvements in predicting similar protein properties by using synthetic datasets and more sophisticated models and methods. We propose that synthetic datasets from physics-based methods can significantly help develop more accurate data-driven methods.

Repository Overview

This repository contains the code and workflows used in our study. The overall pipeline is divided into two major parts:

Training Pafnucy on protein-ligand complexes (PLCs)
Aggregation and analysis of predicted values across multiple conformational frames of each PLC

1. Data Preparation

Before training, the dataset must be prepared in HDF format.

Place the raw PLC structure files inside the plas20k/ directory.
An example is provided in this repository.
For the full PLAS-20k dataset, visit: PLAS-20k dataset

Run the following command to generate HDF files:

python3 ./pafnucy/plas20k_custom_hdf.py \
  --num_frames 30 \
  --total_frames 200 \
  --selection_type uniform

Available Arguments

--num_frames: Number of frames to select per PLC
--total_frames: Total number of frames available
--selection_type: Frame selection method
- Options: uniform, random, starting, clustered

2. Frame Selection (Clustering Option)

If --selection_type clustered is chosen, clustering-based frame selection must be performed.

Steps:

Compute RMSD Matrices
Run the following (requires VMD software):
```
./clustering/get_rmsd.sh
```
This generates RMSD matrices in the clustering/rmsd/ subdirectory.
Select Frames via Clustering
Run:
```
python3 ./clustering/cluster.py \
  --num_frames 25 \
  --total_num_frames 200
```
- Requires list-pdbids.txt (list of PDBIDs for PLCs of interest).
- Outputs a file like selected_points_<num_frames>.txt in the clustering/ directory.

3. Training Pafnucy

Once HDF files are ready, training can be run using:

python3 ./pafnucy/training.py \
  --input_dir ./hdf_30_uniform/ \
  --output_prefix ./plas20k_results_30_uniform/output

The training script is adapted from the original Pafnucy repository.
After completion, results are saved in output-predictions.csv, containing predicted binding affinities for each frame in -log(K_d) units.

4. Aggregation & Analysis

All aggregation and downstream analysis are in the aggregation-analysis/ directory.

Steps:

Place output-predictions.csv inside aggregation-analysis/.
Open and run final-predictions.ipynb
- Generates results.csv
- Converts predictions from -log(K_d) units to kcal/mol units

Run the feed-forward neural network (FFNN) aggregator:

python3 ./aggregation-analysis/neural-network-aggregation.py

Run the final analysis notebook:
```
jupyter notebook aggregation-analysis/final-analysis.ipynb
```
- Produces results and plots
- Compares different aggregation methods

5. Example Files

plas20k/ → Example PLC frames in .mol2 format
clustering/rmsd/ → Example RMSD matrix file
clustering/selected_points_30.txt → Example clustered frame selection output

Requirements

We provide two conda environment files to set up the required dependencies:

train_env.yml → Environment for the training part (pafnucy/ and clustering/)
aggregate_env.yml → Environment for the aggregation and analysis part (aggregation-analysis/)

To create the environments:

# For training
conda env create -f train_env.yml
conda activate train_env

# For aggregation & analysis
conda env create -f aggregate_env.yml
conda activate aggregate_env

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
aggregation-analysis		aggregation-analysis
clustering		clustering
pafnucy		pafnucy
plas20k/16pk		plas20k/16pk
README.md		README.md
aggregate_env.yml		aggregate_env.yml
final_experimental.csv		final_experimental.csv
train_env.yml		train_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Data for More Accurate Deep Learning Models in Molecular Science: A Test Case of Protein-Ligand Binding Affinity Prediction

Repository Overview

1. Data Preparation

Available Arguments

2. Frame Selection (Clustering Option)

3. Training Pafnucy

4. Aggregation & Analysis

Steps:

5. Example Files

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

devalab/PL-Affinity-PLAS-20k

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data for More Accurate Deep Learning Models in Molecular Science: A Test Case of Protein-Ligand Binding Affinity Prediction

Repository Overview

1. Data Preparation

Available Arguments

2. Frame Selection (Clustering Option)

3. Training Pafnucy

4. Aggregation & Analysis

Steps:

5. Example Files

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages