model.trainer.py, all training functions implemented inPreMode_trainerclass, usedata_distributed_parallel_gputo assign and start tasksmodel.model.py,create_modelfunction to get the model (representation + output)model.module.representation.pyandmodel.module.output.pyimplements the representation and output model
model.module.utils.py, search forloss_fnimplements loss functionsutils.configs.py, mainly implements how to do the split and generate train, valid and test data (org file)data.Data.pyimplements the dataset (how to build the graph data from raw data to be trained)
data.filesstoring features (precomputed) for the model- use
generate_feature/esm.inference.pyto generate esm feature, need to provide a csv withuniprotIDandsequencecolumn, stored indata.files/esm.files - download af2 structures in
data.files/af2.files - generate MSA in
data.files/MSA
- use
parse_input_tablecode to preproccess input data file (multiprocess on variant level instead of wt seq level, can be slower than expected)generate_featurecode to generate feature needed for the modeldata,model,utilsmain code for the modelscriptsmodel config files and scripts to help generate the config files
path to the config file
Can be chosen from
- train
- continue_train
- continue from the previous checkpoint
- test
- train_and_test
- default, first train then run test
which gpu to use (only useful in single gpu training, if multigpu, then start from gpu 0)
also specifying the gpu when testing
Choose from model.module.output.py, build_output_model function
For Regression task, change from BinaryClassification to Regression
DDP for DistributedDataParallel, need this to define the training data file for each gpu
naming format: prefix.[0-3].csv
In utils.configs.py, functions start with make_splits_train_val
Used to split train and valid dataset, can be chosen from
- "_by_uniprot_id" (protein level split)
- "" (random split)
- "_by_anno"
- give an extra column to specify which is the train, which is the val, use an extra column
split
- give an extra column to specify which is the train, which is the val, use an extra column
- "_by_good_batch"
- guarantee in a batch there's positive and negative samples
In model.module.utils.py, search for loss_fn
loss_fn_mapping = {
"mse_loss": mse_loss,
"mse_loss_weighted": mse_loss_weighted,
"l1_loss": l1_loss,
"binary_cross_entropy": binary_cross_entropy,
"cross_entropy": cross_entropy,
"kl_div": kl_div,
"cosin_contrastive_loss": cosin_contrastive_loss,
"euclid_contrastive_loss": euclid_contrastive_loss,
"combined_loss": combined_loss,
"weighted_combined_loss": WeightedCombinedLoss,
"weighted_loss": WeightedLoss2,
"weighted_loss_betabinomial": WeightedLoss3,
"gaussian_loss": gaussian_loss,
"weighted_loss_pretrain": WeightedLoss1,
"regression_weighted_loss": RegressionWeightedLoss,
"GP_loss": GPLoss,
}
Choose from ["PreMode", "PreMode_Star_CON", "PreMode_DIFF", "PreMode_SSP", "PreMode_Mask_Predict", "PreMode_Single"]
Seems to have no impact (only used to change the initialize method. If ClinVar, use uniform initialization, if else use default)
So here ClinVar actually means pretrain
used for mandatory use of the seq start and seq end in the data file and crop the seq in that way
If > 1, train on multiple gpus and do not need to specify the gpu-id
save every X batches, this also control the validation frequency