This repository contains scripts for creating and training COSMOS models, as well as scripts for evaluating model performance.
While in the project directory run
git clone https:
cd cosmos
conda env create -f environment.yml
conda activate cosmos
Or you may setup the enviroment step by step
conda create -n cosmos python=3.10
activate the enviroment
conda activate cosmos
Installation
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c conda-forge rdkit
conda install dgl==0.4.2
pip install requirements.txt
Download and unzip the COSMOS models containing the model checkpoint.
Follow these instructions to obtain predictions for your proteins.
-
Download the (inference dataset)
-
Extract the inference dataset
-
Run the corresponding model
python inference.py -d bp_zeroshot_test -e bp_zeroshot --runs 3
We have conducted retrieval completion on the basis of ProteinKG25 and have presented our COSMOS dataset to download our prepared data, 2) generate your own dataset.
For generating your own pre-training data, you need download following raw data:
go.obo: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`uniprot_sprot.dat: protein Swiss-Prot database.goa_uniprot_all.gaf: Gene Annotation data.
When download these raw data, you can execute following script to generate pre-training data:
python tools/gen_onto_protein_data.py
-
For retrieval completion, you need to find relevant protein relations through StringDB and [FoldSeek](Foldseek Search Server), and the eligible protein relations are in the COSMOS dataset
-
For the relation above, extract the topK relation for the given protein/GO
To train the models and reproduce our results, you can directly download the dataset we provide or generate the dataset yourself.
-
Download the COSMOS dataset
-
generating the dataset
-
zero-shot dataset
python utils/prepare_meta_data.py -nd cosmos -n 5000 -t GO -sl bp_zeroshot.txt -s 3 -
few-shot dataset
python utils/prepare_meta_data.py -nd cosmos-n 5000 -t GO -sl bp_fewshot.txt -s 3
-
-
Training example:
-
Train a single COSMOS BP prediction model which
python train.py -d bp_zeroshot -e bp_zeroshotpython train.py \ --experiment_name "experiment" --dataset "dataset" --gpu 0 \ --disable_cuda \ --load_model \ --num_epochs 100 -ne 100 \ --eval_every 3 \ --early_stop 200 \ --lr 0.01 \ --clip 1000 \ --margin 10 \ --hop 3 \ --max_nodes_per_hop 256 -max_h 256 \ --batch_size 128 \ --num_neg_samples_per_link 1 -neg 1 \ --num_workers 8 \ --edge_dropout 0.5 \ --add_ht_emb -ht \ --has_attn -attn
-
The training scripts generate predictions for the test data that are used to compute evaluation metrics.
- To evaluate predictions run test_auc_complement.py script. Example:
python inference.py -d bp_zeroshot_test -e bp_zeroshot --runs 3 - To evaluate ranking use test_ranking.py script. Example:
python test_ranking.py -d bp_zeroshot_test -e bp_zeroshot
