This is a program for identifying regions of high similarity based on k-mer content to some set of query sequences, relative to a null background set of sequences.
check with command
python -V
if < 3.6, easiest solution is to download anaconda python3.6/3.7
Try
which cython
if none
pip install cython
OR
conda install -c anaconda cython
git clone https://github.com/wdlv/mSEEKR.git
cd mSEEKR/
python setup.py build_ext --inplace
- Curate unique fasta files for queries and null model before hand
- Use the following command
python kmers.py --fasta ./fastaFiles/mA.fa -k 2,3,4 --name mouseA --dir ./counts/
python kmers.py --fasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa -k 2,3,4 --name mm10Trscpts --dir ./counts/
- --fasta : Path to fasta file. (Required)
- -k : Comma delimited string of possible k-mer values. For example, 3,4,5 or just 4. (Required)
- --name : Desired output name for count file. Default is 'out'
- --dir : Directory to save output count file. Default is './'
- -a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'
Output:
Outputs binary .dict files containing count matrices to directory set by --dir
- Count k-mers for queries and null before proceeding
- Run the following command
python train.py --query ./counts/mouseA.dict --null ./counts/mm10Trscpts.dict -k 2,3,4 --qPrefix mouseA --nPrefix mm10Trscpts --qT .9999 --nT .9999 --dir ./markovModels/
Parameters:
- --query : Path to kmer count file for sequences of interest (e.g. functional regions of a ncRNA). (Required)
- --null : Path to kmer count file that compose null model (e.g. transcriptome, genome, etc). (Required)
- -k : Comma separated list of values of k to train for, must have been calculated prior. (Required)
- --qPrefix : prefix file name for query. (Required)
- --nPrefix : prefix file name for null. (Required)
- --qT : Query to query transition parameter, query to null is 1 - qT. (Required)
- --nT : Null to null transition parameter, null to query is 1 - nT. (Required)
- --dir : Output directory. Default is './'
- -a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'
Output:
Directory containing models for each value of k specified, the directory structure would look like:
| markovModels
|
|--- mouseA_mouseNull
|------- 2
|------------- hmm.dict
|------- 3
|------------- hmm.dict
|------- 4
|------------- hmm.dict
|--- mouseB_mouseNull
.
.
.
This script will take the output .mkv output file from train.py and find an MLE for the transition parameters.
- Run the following command
python bw.py -k 4 --db ./fastaFiles/xist.fa --prior markovModels/mouseA_mm10Trscpts/4/hmm.dict --its 3 -cf
The above command takes the HMM trained on repeat A at k =4, and uses Xist to find an MLE for the transition parameters. More than one sequence can be provided as training, if a multi-entry FASTA file is provided.
Parameters:
- -k : Value of k for k-mers. One integer is requried. (Required)
- --db : Path to fasta file containing training sequences. (Required)
- --prior : Path to binary .dict file output from train.py(e.g. markovModels/D_null/2/hmm.dict). (Required)
- --its : Number of iterations of the baum-welch algorithm to run. Default is 20
- -cf : FLAG, pass this argument to create a new file in the same directory as --prior rather than overwrite
Output:
- Replaces the file passed as --prior with an updated version that contains the MLE transition matrix
- A file containing the trajectory of the transition parameters over each loop of the BW algorithm, check this for convergence
- Run the mSEEKR command
python mSEEKR.py --db ./fastaFiles/mm10kcn.fa --prefix kcn_queryMouseA --model ./markovModels/mouseA_mm10Trscpts/4/hmm_MLE.dict -k 4 --fasta
Parameters
- --db : Path to fasta file with sequences to calculate similarity score. (Required)
- --model : Path to .dict file output from train.py or bw.py. (Required)
- -k : Value of k to use. Must be the same as the k value used in training. (Required)
- --prefix : File name for output, useful to include information about the experiment. (Required)
- Run the getBackgroundMatrixMeanStd.py command
python getBackgroundMatrixMeanStd.py --backgroundFasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa -k 4
Parameters
- --backgroundFasta : Path to lncRNA background sequences fasta file; used to calculate mean and standard deviation of k-mer seekr score matrix. (Required)
- -k : The same k value used in the python mSEEKR.py step. (Required)
- --name : Name for the file output background fasta kmer seekr score matrix mean and std. Default is 'bgMatrixMeanStd'
- --outdir : Directory to save output background fasta kmer seekr score matrix mean and std. Default is './'
- -a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'
- Run the getSeekrScore command
python getSeekrScore.py --queryFasta ./fastaFiles/mA.fa --backgroundFasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa --backgroundMatrixMeanStd ./bgMatrixMeanStd.txt --mSEEKRdataframeDir ./kcn_queryMouseA_4_viterbi.txt -k 4
Parameters
- --queryFasta : Path to query fasta file. NOTE: All the sequences in query fasta file will be merged to one sequence before processing. (Required)
- --backgroundFasta : Path to lncRNA background sequences fasta file. (Required)
- --backgroundMatrixMeanStd : Path to lncRNA background matrix mean and std generated from getBackgroundMatrixMeanStd.py. Note: backgroundMatrixMeanStd and backgroundFasta are using the same fasta file. The reason there are two commands is that computing mean and std of background fasta kmer seekr score requires too much resources. (Required)
- --mSEEKRdataframeDir : Directory to read in the output dataframe generated from mSEEKR.py. (Required)
- -k : The same k value used in the mSEEKR.py step. (Required)
- --dir : Directory to save output dataframe. Default is './'
- --name : Name for output file. Default is 'output'
- --minSeqLength : The minimum length of sequence found. Default is 0
- -a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'