TorsionalDiffusionHPC is a fork of torsional-diffusion, which adds support to run it on HPC systems using Slurm and Singularity.
For more details about torsional-diffusion we refer to the original Github repo and the paper on arXiv.
- Singularity
- Slurm
-
Clone the repository and navigate to it
git clone https://github.com/Jnelen/TorsionalDiffusionHPCcd TorsionalDiffusionHPC -
Download the singularity image (~4 GB) to the singularity directory located in the main TorsionalDiffusionHPC directory. The singularity image contains all the necessary packages and dependencies to run DiffDock correctly
wget --no-check-certificate -r "https://drive.usercontent.google.com/download?id=1Uzx7OqghIqSoNBpZ1_2V76sMvl7XXOS2&confirm=t" -O singularity/TorsionalDiffusionHPC.sifalternatively, you can build the singularity image yourself using:
singularity build singularity/TorsionalDiffusionHPC.sif singularity/TorsionalDiffusionHPC.def -
Download one of the trained models to the
workdirdirectory from this shared Drive. I set the default model to use drugs_default, so I recommend installing this one to theworkdirdirectory, however other models are supported as well. make theworkdirto download the model to:mkdir workdirdownload the drugs_default model:
wget --no-check-certificate -r "https://drive.usercontent.google.com/download?id=1Yez3v0H8trS4jAnrn8vdzt-R7TkM1L_U&confirm=t" -O workdir/drugs_default.zipunzip the model and remove the zip file:
unzip workdir/drugs_default.zip -d workdir/rm workdir/drugs_default.zip -
Run a test example to generate the required (hidden) .npy files. This only needs to happen once and should only take about 5-10 minutes.
mkdir outputpython launch_jobs.py -l data/test.csv -out output/test -j 1
I attempted to provide most of the original options implemented in torsional-diffusion, while also keeping things simple. Additionally, I added some useful features (for example compressing the results, removing salts, ...) and scripts which can make general usage easier. Here is a short overview:
--ligands LIGANDS, -l LIGANDS: The path to and sdf file or a directory of mol(2)/sdf ligand files. Csv and pkl files are also accepted as input. All of these formats are also allowed to have been compressed by gzip (.gz)--out_dir OUT_DIR, -out OUT_DIR, -o OUT_DIR: Directory where the output structures will be saved to--num_confs NUM_CONFS, -n NUM_CONFS: How many conformers to output per compound. The default value is 10--dump_pymol: Save .pdb file with denoising dynamics--pre-mmff: Run MMFF on the local structure conformer--post-mmff: Run MMFF on the final generated structures--no_energy, -ne: Skip Calculating the energies and other metrics--particle_guidance {0,1,2,3,4}, -pg {0,1,2,3,4}: Define which type of Particle Guidance you want to use:- 0: No particle guidance
- 1: Permutation invariant minimize recall error
- 2: Permutation invariant minimize precision error
- 3: Non-permutation invariant minimize recall error
- 4: Non-permutation invariant minimize precision error
--smiles_as_id, -si: Use a molecule's smile as the ID, even if the molecule has a name--compress_output, -co: Compress the output pkl files using gzip--remove_salts, -rs: Remove salts and fragments from the input molecules--random_coords, -rc: Use the "useRandomCoords=True" option when generating initial RDKit conformers (more robust, but slower)--random_seed RANDOM_SEED, --seed RANDOM_SEED: Random seed to produce (approximate) deterministic results for identical datasets.
--jobs JOBS, -j JOBS: Number of jobs to use--time TIME, -t TIME, -tj TIME: Max amount of time each job can run--queue QUEUE, -qu QUEUE: On which node to launch the jobs. The default value is the default queue for the user. Might need to be specified if there is no default queue configured--mem MEM, -m MEM: How much memory to use for each job. The default value is 4GB--gpu, -gpu, -GPU, --GPU: Use GPU resources. This will accelerate docking calculations if a compatible GPU is available--cores CORES, -c CORES: How many cores to use for each job. The default value is 1
The additional scripts are located in the scripts/ directory. Currently there are two:
- relaunchFailedJobs.py
Sometimes jobs fail or produce errors. This can especially be annoying when running a large amount of jobs. After all jobs stopped running, but not all jobs finished successfully, you run this script to automatically rerun the jobs that didn't produce a final output.
Usage:python scripts/relaunchFailedJobs.py <output_directory> - joinPkls.py
This script can join all the results from every job back together into one large (compressed) pkl. Additionally, energy csvs will also be joined if they were generated.
Usage:singularity run singularity/TorsionalDiffusionHPC.sif python scripts/joinPkls.py <output_directory>
MIT