Synthetic Data for More Accurate Deep Learning Models in Molecular Science: A Test Case of Protein-Ligand Binding Affinity Prediction
Deep learning models are data-hungry, and synthetic (artificial) data has been shown to be invaluable when data availability is low. While this has been demonstrated in certain technology areas, adopting such an approach is new in machine learning (ML) applications in chemistry, except for some pre-training tasks. In drug discovery, predicting binding energy between proteins and ligands is crucial. Many ML-based studies have been proposed to predict protein-ligand binding affinity using existing experimental data. However, it has been shown that these models suffer from inherent biases. Recent efforts have resulted in PLAS-20k, a synthetic dataset of multiple protein-ligand complex (PLC) conformations generated using molecular dynamics (MD) simulation as a viable option to be used along with existing experimental data to improve binding affinity prediction. For the binding affinity prediction task, we employ Pafnucy, a deep convolutional neural network, and we propose using multiple structures for each PLC from PLAS-20k to train it. We compare four different statistical and ML-based result-aggregation techniques. This work demonstrates the utility of dynamic datasets in enhancing binding affinity predictions, laying the foundations for future improvements in predicting similar protein properties by using synthetic datasets and more sophisticated models and methods. We propose that synthetic datasets from physics-based methods can significantly help develop more accurate data-driven methods.
This repository contains the code and workflows used in our study. The overall pipeline is divided into two major parts:
- Training Pafnucy on protein-ligand complexes (PLCs)
- Aggregation and analysis of predicted values across multiple conformational frames of each PLC
Before training, the dataset must be prepared in HDF format.
- Place the raw PLC structure files inside the
plas20k/directory. - An example is provided in this repository.
- For the full PLAS-20k dataset, visit: PLAS-20k dataset
Run the following command to generate HDF files:
python3 ./pafnucy/plas20k_custom_hdf.py \
--num_frames 30 \
--total_frames 200 \
--selection_type uniform--num_frames: Number of frames to select per PLC--total_frames: Total number of frames available--selection_type: Frame selection method- Options:
uniform,random,starting,clustered
- Options:
If --selection_type clustered is chosen, clustering-based frame selection must be performed.
Steps:
-
Compute RMSD Matrices
Run the following (requires VMD software):./clustering/get_rmsd.sh
This generates RMSD matrices in the
clustering/rmsd/subdirectory. -
Select Frames via Clustering
Run:python3 ./clustering/cluster.py \ --num_frames 25 \ --total_num_frames 200
- Requires
list-pdbids.txt(list of PDBIDs for PLCs of interest). - Outputs a file like
selected_points_<num_frames>.txtin theclustering/directory.
- Requires
Once HDF files are ready, training can be run using:
python3 ./pafnucy/training.py \
--input_dir ./hdf_30_uniform/ \
--output_prefix ./plas20k_results_30_uniform/output- The training script is adapted from the original Pafnucy repository.
- After completion, results are saved in
output-predictions.csv, containing predicted binding affinities for each frame in -log(K_d) units.
All aggregation and downstream analysis are in the aggregation-analysis/ directory.
-
Place
output-predictions.csvinsideaggregation-analysis/. -
Open and run
final-predictions.ipynb- Generates
results.csv - Converts predictions from -log(K_d) units to kcal/mol units
- Generates
-
Run the feed-forward neural network (FFNN) aggregator:
python3 ./aggregation-analysis/neural-network-aggregation.py
-
Run the final analysis notebook:
jupyter notebook aggregation-analysis/final-analysis.ipynb
- Produces results and plots
- Compares different aggregation methods
plas20k/→ Example PLC frames in.mol2formatclustering/rmsd/→ Example RMSD matrix fileclustering/selected_points_30.txt→ Example clustered frame selection output
We provide two conda environment files to set up the required dependencies:
train_env.yml→ Environment for the training part (pafnucy/andclustering/)aggregate_env.yml→ Environment for the aggregation and analysis part (aggregation-analysis/)
To create the environments:
# For training
conda env create -f train_env.yml
conda activate train_env
# For aggregation & analysis
conda env create -f aggregate_env.yml
conda activate aggregate_env