Code for finding putative enhancers using Hi-C data
python htad-chain.py <config file>
For running the example use
python htad-chain.py examples/himr90.chr20.conf
from the repo directory.
resresolution of the Hi-C file, in bases (40000)wininteraction distance cutoff in bases, usually 2000000chrnamechromosome name, used for genes, size and output. should use 'chr1'..'chrX'chrsizepath to bed file of chromsome lengths (examples/hg19.size.bed)output_prefixoutput prefix for this conf file (hIMR90)output_dirpath in which to store the output files (examples/output)input_matrixpath to input Hi-C matrix for the chromosome (see format specifications below)genes_filepath to bed file describing genes (examples/hg19.genes.bed)
For a functioning example config file consult examples/himr90.chr20.conf
The code was used on a Linux machine. It has scripts in matlab, python and perl, so the minimal requirements would be -
- Matlab
- python2.7
- Perl
- Unix tools - cut, sed, pushd, popd (typically installed by default)
htad-chain.pyMain command line interface for PSYCHICmatlab/Main matlab filesdomaincall_software/Slightly adapted files from Hi-C DomainCallerinsulation/Domain caller from Crane et al. and additional scripts Insulation Score DomainCallerexamples/Example files, contains config, Hi-C matrix, chromsome sizes and gene list
input_matrix should be in a csv or tab-delimitered file, specifying the Hi-C data (for a given chomromsome).
The first column could be either empty or contain the names of each genomic segment (in a fixed stride, as specified in the res variable).
The rest of the matrix should be symmetric, with each cell (< i,j >) containing the number of contacts between the matching segments in the chromosome. These data could be previously normalized to account for various Hi-C biases, and is assumed to be symmetric. See example under examples/hIMR90.chr20.matrix.txt.
The program outputs multiple intermediate files, and final enhancers files.
.enh_p.bedbed file of over-represented pairs with FDR value < p, each line is of the format [chr start end], [gene, distance to enhancer, FDR, p-value, expected # of interactions, observed # of interactions].model.estimated.params.bedthe power-law parameters for the model, each line represents a TAD / merge / Sky Specifically, fields include [ chr start end ], class { TAD / merge / Sky }, parameters for first (or only) power-law segment [ slope, intersect, RMSE ], same for second power-law segment (or NaN), transition point between two models, in log2(bp).
Temporary files
.enh_rand.bedset of random interactions with promoters, used as control for near-promoter enrichment values.domains,domains.txtthe domains found be the specified TAD-calling algorithm (DI/Insulation Score).domains.new.txtrefined domains found by the program.7col,.DI,.HMMused by Directionality Index domain caller.prob.{bg,tad}.matrix.txt,supersum.txtthe probabilstic parameters, based on the given domains.model.estimated.matrix.txtthe model, estimated using the data, in tab deilimeted format.llr.txtthe log-likelihood ratio of observed counts and the model.hierarchy.bedthe constructed hierarchy of the domains.mdump*files are the outputs of executed matlab functions, for debugging..fixed_matrixthe input matrix converted to the desired format
The prefix specified in the configuration file is prepended to the mentioned names output_prefix and chrname