DeNoPro - a de novo proteogeomics pipeline to identify clinically relevent novel variants from RNAseq and Proteomics data.
DeNoPro provides a pipeline for the identification of novel peptides from matched RNAseq and MS/MS proteomics data.
The pipeline consists of de novo transcript assembly (Trinity), generation of a protein sequence database of 6-frame translated transcripts, and a combination of search engines (X! Tandem, MS-GF+, Tide) to query the custom database. Identified novel peptides and protein variants are then filtered by confidence and mapped to gene models using ACTG.
To install DeNoPro as a python module, open a terminal in the directory containing setup.py, and run
python setup.py install
DeNoPro can be made executable by running chmod u+x denopro.
DeNoPro has been tested with Python 3, Python 2 is not supported at this time. R version 4.0.0 or greater is required to run the PGA package.
We recommend using a conda environment to maintain dependencies, and an environment config file using Python 3.9.6 and R 4.0.5 has been provided. To setup the conda environment, run conda env create -f denopro-env.yml and activate with conda activate denopro-env.
- Trinity version 2.8.5 - Used during
assemblefor de novo assembly of RNA transcripts - PGA (R>4.0) - Used in
customdbfor creation of 6-frame translated protein database - PySimpleGUIQt - Used to run the GUI functionality
- SearchGUI version 3.3.17 - Uses the
X! Tandem,MS_GF+andTidesearch engines to search created custom database against mgf spectra files - PeptideShaker version 1.16.42 - Used to select matching identifications among the three search engines to output a list of confident novel peptides and their corresponding proteins
- ACTG - Used to map identified confident novel peptides to their corresponding genomic locations
- Bamstats - Used to process expression levels of novel peptides
DeNoPro was designed to be modular, to account for large processing times. The modes are
assemble : de novo assembly of transcript sequences using Trinity
searchdb : produces custom peptide database from assembled transcripts which are mapped against proteomics data
identify : maps potential novel peptides from searchdb to a reference tracriptome outputting a list of confident novel peptides
novelorf : finds novel ORFs in identified novel peptides
quantify : evaluates expression levels of identified novel peptides in a sample
The standard workflow is
assemble >> searchdb >> identify >> novelorf >> quantify
denovo assembly of transcript sequences using Trinity
denopro assemble [options]
-c/--config_file: Point to the path of config file to use. Default is./denopro.conf--cpu: Maximum number of threads to be used by Trinity--max_mem: Maximum number of RAM (in GB) that can be allocated
output_dir: Directory to use as pipeline outputdependency_locations/trinity: Full path to Trinity installationdirectory_locations/fastq_for_trinity: Directory containing FASTQ files
Produces custom peptide database from assembled transcripts which are mapped against proteomics data
denopro searchdb [options]
-c/--config_file: Point to the path of config file to use. Default is./denopro.conf
output_dir: Directory to use as pipeline outputdependency_locations/searchgui: Full path to SearchGUI.jarfiledependency_locations/peptideshaker: Full path to PeptideShaker.jarfiledirectory_locations/spectra_files: Directory containing.mgffiles for database searchingdependency_locations/hg19: Full path to reference transciptome (FASTA) of protein coding genes
Maps potential novel peptides from customdb to a reference tracriptome, outputting a list of confident novel peptides
denopro identify [options]
-c/--config_file: Point to the path of config file to use. Default is./denopro.conf
output_dir: Directory to use as pipeline outputdependency_locations/actg: Full path to directory containingACTG.jarandparam.xmlfiles
Note: Transcriptome model and reference genome are only needed if a serialization file needs to be constructed. If a serialization file is needed, leave
serialization_fileblank.
actg_options/transcriptome_gtf: Path to transcriptome model to be used for mappingactg_options/ref_genome: Path to directory containing reference genome (each file name must be the same as chromosome number written in the GTF files)actg_options/mapping_method: Mapping method to be used. Options arePV(Mapping [P]rotein database first, then [V]ariant splice graph),PS(Mapping [P]rotein database first, then [S]ix-frame translation),VO(Mapping [V]ariant splice graph [O]nly),SO(Mapping [S]ix-frame translation [O]nly)protein_database: Ifmapping_methodis PV or PS, path to directory containing protein databaseserialization_file: Path to serialization file of a variant splice graph
Finds novel ORFs in identified novel peptides
denopro novelorf [options]
-c/--config_file: Point to the path of config file to use. Default is./denopro.conf
output_dir: Directory to use as pipeline output
Evaluates expression levels of identified novel peptides
denopro quantify [options]
-c/--config_file: Point to the path of config file to use. Default is./denopro.conf
output_dir: Directory to use as pipeline outputquantification_options/bamstats: Full path to bamstats.jarfilequantification_options/bam_files: Full path to directory containing BAM files to be analysedquantification_options/bed_file: Full path to BED file to be used. Will be created with data from previous steps if left blank
DeNoPro offers a graphical interface to run the pipeline and edit configuration files.

The GUI uses the Qt framework through PySimpleGUIQt which can be installed with `conda install PySimpleGUIQt'.

