mSEEKR

This is a program for identifying regions of high similarity based on k-mer content to some set of query sequences, relative to a null background set of sequences.

Dependencies

Python3.6

check with command

python -V

if < 3.6, easiest solution is to download anaconda python3.6/3.7

cython

Try

which cython

if none

pip install cython

OR

conda install -c anaconda cython

Installation

Type following commands in desired directory

	git clone https://github.com/wdlv/mSEEKR.git
	cd mSEEKR/
	python setup.py build_ext --inplace

Ignore warnings (unless further steps don't work)

Counting k-mers

Curate unique fasta files for queries and null model before hand
Use the following command

  python kmers.py --fasta ./fastaFiles/mA.fa -k 2,3,4 --name mouseA --dir ./counts/
  python kmers.py --fasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa -k 2,3,4 --name mm10Trscpts --dir ./counts/

Parameters:

--fasta : Path to fasta file. (Required)
-k : Comma delimited string of possible k-mer values. For example, 3,4,5 or just 4. (Required)
--name : Desired output name for count file. Default is 'out'
--dir : Directory to save output count file. Default is './'
-a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'

Output:

Outputs binary .dict files containing count matrices to directory set by --dir

Training markov models

Count k-mers for queries and null before proceeding
Run the following command

  python train.py --query ./counts/mouseA.dict --null ./counts/mm10Trscpts.dict -k 2,3,4 --qPrefix mouseA --nPrefix mm10Trscpts --qT .9999 --nT .9999 --dir ./markovModels/

Parameters:

--query : Path to kmer count file for sequences of interest (e.g. functional regions of a ncRNA). (Required)
--null : Path to kmer count file that compose null model (e.g. transcriptome, genome, etc). (Required)
-k : Comma separated list of values of k to train for, must have been calculated prior. (Required)
--qPrefix : prefix file name for query. (Required)
--nPrefix : prefix file name for null. (Required)
--qT : Query to query transition parameter, query to null is 1 - qT. (Required)
--nT : Null to null transition parameter, null to query is 1 - nT. (Required)
--dir : Output directory. Default is './'
-a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'

Output:

Directory containing models for each value of k specified, the directory structure would look like:

| markovModels
|
|--- mouseA_mouseNull
|------- 2
|------------- hmm.dict
|------- 3
|------------- hmm.dict
|------- 4
|------------- hmm.dict
|--- mouseB_mouseNull
.
.
.

(OPTIONAL) Optimize transition parameters using the B-W Algorithm

This script will take the output .mkv output file from train.py and find an MLE for the transition parameters.

Run the following command

  python bw.py -k 4 --db ./fastaFiles/xist.fa --prior markovModels/mouseA_mm10Trscpts/4/hmm.dict --its 3 -cf

The above command takes the HMM trained on repeat A at k =4, and uses Xist to find an MLE for the transition parameters. More than one sequence can be provided as training, if a multi-entry FASTA file is provided.

Parameters:

-k : Value of k for k-mers. One integer is requried. (Required)
--db : Path to fasta file containing training sequences. (Required)
--prior : Path to binary .dict file output from train.py(e.g. markovModels/D_null/2/hmm.dict). (Required)
--its : Number of iterations of the baum-welch algorithm to run. Default is 20
-cf : FLAG, pass this argument to create a new file in the same directory as --prior rather than overwrite

Output:

Replaces the file passed as --prior with an updated version that contains the MLE transition matrix
A file containing the trajectory of the transition parameters over each loop of the BW algorithm, check this for convergence

Find HMM state path through sequences of interest

Run the mSEEKR command

  python mSEEKR.py --db ./fastaFiles/mm10kcn.fa --prefix kcn_queryMouseA --model ./markovModels/mouseA_mm10Trscpts/4/hmm_MLE.dict -k 4 --fasta

Parameters

--db : Path to fasta file with sequences to calculate similarity score. (Required)
--model : Path to .dict file output from train.py or bw.py. (Required)
-k : Value of k to use. Must be the same as the k value used in training. (Required)
--prefix : File name for output, useful to include information about the experiment. (Required)

Calculate background fasta's seekr pearson correlation score's mean and standard deviation

Run the getBackgroundMatrixMeanStd.py command

python getBackgroundMatrixMeanStd.py --backgroundFasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa -k 4

Parameters

--backgroundFasta : Path to lncRNA background sequences fasta file; used to calculate mean and standard deviation of k-mer seekr score matrix. (Required)
-k : The same k value used in the python mSEEKR.py step. (Required)
--name : Name for the file output background fasta kmer seekr score matrix mean and std. Default is 'bgMatrixMeanStd'
--outdir : Directory to save output background fasta kmer seekr score matrix mean and std. Default is './'
-a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'

Calculate seekr pearson correlation score

Run the getSeekrScore command

python getSeekrScore.py --queryFasta ./fastaFiles/mA.fa --backgroundFasta ./fastaFiles/gencode.vM17.lncRNA_transcripts.fa --backgroundMatrixMeanStd ./bgMatrixMeanStd.txt --mSEEKRdataframeDir ./kcn_queryMouseA_4_viterbi.txt -k 4

Parameters

--queryFasta : Path to query fasta file. NOTE: All the sequences in query fasta file will be merged to one sequence before processing. (Required)
--backgroundFasta : Path to lncRNA background sequences fasta file. (Required)
--backgroundMatrixMeanStd : Path to lncRNA background matrix mean and std generated from getBackgroundMatrixMeanStd.py. Note: backgroundMatrixMeanStd and backgroundFasta are using the same fasta file. The reason there are two commands is that computing mean and std of background fasta kmer seekr score requires too much resources. (Required)
--mSEEKRdataframeDir : Directory to read in the output dataframe generated from mSEEKR.py. (Required)
-k : The same k value used in the mSEEKR.py step. (Required)
--dir : Directory to save output dataframe. Default is './'
--name : Name for output file. Default is 'output'
--minSeqLength : The minimum length of sequence found. Default is 0
-a : String, Alphabet to generate k-mers (e.g. ATCG). Default is 'ATCG'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mSEEKR

Dependencies

Python3.6

cython

Installation

Type following commands in desired directory

Ignore warnings (unless further steps don't work)

Counting k-mers

Parameters:

Training markov models

(OPTIONAL) Optimize transition parameters using the B-W Algorithm

Find HMM state path through sequences of interest

Calculate background fasta's seekr pearson correlation score's mean and standard deviation

Calculate seekr pearson correlation score

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
counts		counts
fastaFiles		fastaFiles
markovModels		markovModels
.gitignore		.gitignore
README.md		README.md
bw.py		bw.py
corefunctions.py		corefunctions.py
getBackgroundMatrixMeanStd.py		getBackgroundMatrixMeanStd.py
getSeekrScore.py		getSeekrScore.py
kmers.py		kmers.py
kmers.pyx		kmers.pyx
mSEEKR.py		mSEEKR.py
setup.py		setup.py
train.py		train.py

wdlv/mSEEKR

Folders and files

Latest commit

History

Repository files navigation

mSEEKR

Dependencies

Python3.6

cython

Installation

Type following commands in desired directory

Ignore warnings (unless further steps don't work)

Counting k-mers

Parameters:

Training markov models

(OPTIONAL) Optimize transition parameters using the B-W Algorithm

Find HMM state path through sequences of interest

Calculate background fasta's seekr pearson correlation score's mean and standard deviation

Calculate seekr pearson correlation score

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages