The purpose of this repository is to collect useful scripts which mainly use RDKit. Contributions are welcome!
Some scripts may require further dependencies.
- There is a
read_input.pyscript which contains the functionread_input. It reads molecules from SMI, SDF, SDF.GZ and PKL (pickled molecules as tuples of mol and mol_title) files and STDIN (SMI and SDF formats are supported) and it returns tuples of (mol, mol_title). This is a generator and can be applied to process large collections of molecules. I advise to use this function if you do not need other data from input files. - There is
_template.pyfile which can be used as a template for new scripts. Please do not change names for input, output, ncpu and verbose arguments. This will help to make command line arguments consistent across scripts. - Add help messages to your scripts.
- Ideally scripts should be able to communicate with STDIN and STDOUT to combine them with pipes. I implemented this in
gen_stereo_rdkit.pyandgen_conf_rdkit.py. - All scripts can contain errors, so use them on your own risk. If you will find a mistake please create the issue and we will fix it. However, we constantly revise old scripts and fix errors because every found mistake is penultimate.
| Script | Description |
|---|---|
add_prefix |
Add a prefix to molecule names in SDF file. |
extractsdf |
Extract molecule names and field values from input SDF. |
extract_mol_by_name |
Extract molecules by name (partial name matching) to new SDF file. |
insert_sdf |
Add data from a text file as additional fields to input SDF file. |
remove_dupl_by_field |
Remove entries from SDF file having duplicated mol title or field value. |
rename_mols |
Identify identical entries (conformers) and rename consistently. |
sdf_field2title |
Insert field values into molecular title (or SMILES, or sequential titles). |
sdf_title2field |
Insert molecular title into a given SDF field. |
strip_blank_lines |
Remove empty lines in multi-line field values in input SDF. |
| Script | Description |
|---|---|
cansmi |
Return canonical SMILES of input molecules. |
frags2mols |
Save disconnected components as individual molecules with suffix in name. |
molchemaxon2pdb |
Convert molecules to separate PDB files using RDKit & ChemAxon. |
mols2pdb |
Convert molecules (SMI/SDF) to PDB, adding hydrogens and conformers. |
pkl2sdf |
Convert PKL to SDF (e.g. conformers generated by gen_conf_rdkit). |
sdf2mols |
Split SDF into multiple MOL files. |
sdf2pkl |
Convert SDF to multi-conformer PKL (requires sequential mol titles). |
smi2sdf |
Convert SMILES to SDF including extra fields if present. |
split_pdb |
Split PDB by chains and save to separate PDB files. |
Manipulate with Mol objects (calc properties, generate conformers/stereoisomers, filter compounds, etc):
| Script | Description |
|---|---|
add_h |
Add hydrogens to molecules. |
calc_center_rdkit |
Calculate geometric center of atoms. |
compare_charged_centers |
Get SMILES patterns of charged centers in two sets of molecules. |
count_undefined_stereocenters |
Count undefined stereocenters and print names + counts. |
discard_compounds_rdkit |
Remove multi-component & non-organic molecules. |
draw_mols |
Return PNG images of molecules. |
filter_conf |
Filter conformers by RMS value. |
filter_conf_adv |
Select representative conformers using clustering and advanced features. |
gen_conf_rdkit |
Generate conformers. |
gen_stereo_rdkit |
Enumerate stereoisomers (tetrahedral & double bond). |
gen_stereo_rdkit_native |
Enumerate stereoisomers using RDKit’s built-in function. |
get_map |
Calculate UMAP/t-SNE coordinates for input structures. |
get_mol_center |
Return geometric center of molecule. |
get_substr |
Filter molecules by SMARTS (supports multiple patterns & negative matches). |
get_total_charge |
Calculate total formal charge. |
keep_largest |
Keep largest fragment by heavy atom count. |
mirror_mols |
Generate mirrored 3D structures (enantiomers). |
murcko |
Return Murcko scaffolds ignoring stereochemistry. |
neutralize |
Neutralize structures. unipka_protonate3.py |
physchem_calc |
Calculate physicochemical properties (MW, logP, TPSA, QED, etc.). |
pmapper_descriptors |
Calculate 3D pharmacophore descriptors (with pmapper). |
remove_stereo |
Remove stereoconfiguration from all centers. |
remove_dupl_rdkit |
Remove duplicates via InChi key comparison. |
rmsd_rdkit |
Calculate RMSD (MCS if atom matching fails, with symmetry checks). |
sanitize_rdkit |
Remove molecules with sanitization errors + annotate stereocenters, etc. |
sphere_exclusion |
Select diverse subset of compounds. |
test_pains |
Return list of molecules matching PAINS. |
| Script | Description |
|---|---|
binning |
Take a table with values and return binned values based on thresholds. |