POLYNET is a deep learning project for predicting polyadenylation sites (PAS) in genomic sequences. It uses a convolutional neural network (CNN) to classify candidate nucleotide positions as true PAS or not, based on a fixed-length window of sequence data.
POLYNET/
├── src/
│ ├── train.py
│ ├── models/
│ │ └── POLYNET.py
│ ├── data/
│ │ ├── Pas_Dataset.py
│ │ └── processed/
│ │ ├── pos_201_train.fa
│ │ └── ...
│ └── utils/
│ └── encoding.py
├── scripts/
│ └── split_data.py
├── models/
│ └── POLYNET.pt
├── requirements.txt
├── README.md
└── notebooks/
- Clone the repository
git clone <your-repo-url> cd POLYNET
- Create a virtual environment (recommended)
python3 -m venv venv source venv/bin/activate - Install dependencies
pip install -r requirements.txt
-
Place your raw FASTA files (e.g.,
pos_201_hg19.fa,neg_201_hg19.fa) insrc/data/.- Note: These files were generated using PolyADB 3.0 and are sets of positive and negative examples for training. They are included as part of the repo for reproducibility purposes.
-
Processed files in train/test/val splits are located in src/data/processed.
Run the training script from the project root using the Python module flag:
python -m src.trainThis will train the model, print training/validation metrics, and save the trained model to models/POLYNET.pt.
The training script (src/train.py) includes functionality for random hyperparameter search. By default, it runs multiple experiments with randomly selected values for batch size, learning rate, and number of epochs. For each experiment, the script trains and evaluates the model, and records the results.
- Configuration:
- The hyperparameter ranges are defined in the
hyper_paramsdictionary insrc/train.py. - You can modify the values for
batch_size,lr, andepochsto explore different settings.
- The hyperparameter ranges are defined in the
- Execution:
- When you run the training script, it will perform 10 experiments (by default), each with a different random combination of hyperparameters.
- You can change the number of experiments by modifying the loop in the main block.
- Results:
- After all experiments, the results (including hyperparameters, training/validation losses, and test metrics) are saved to
models/model_outputs.csv. - Each row in the CSV corresponds to one experiment, with columns for each hyperparameter and metric.
- After all experiments, the results (including hyperparameters, training/validation losses, and test metrics) are saved to
After training, the script will automatically evaluate the model on the test set and print AUROC and AUPRC metrics.
- Make sure to run all scripts from the project root directory so that relative imports and paths work correctly.
- You can modify hyperparameters (batch size, learning rate, epochs) in
src/train.py.
See requirements.txt for a full list of dependencies.