We present here the program and the results of our research for constructing a feasibly running algorithm that once given a collection of unorganised reads (strings of nucleobases A, C, G, T) from a tumour tissue and a healthy tissue of the same specimen, the algorithm searches for mutations of certain types. Which, in turn, when evaluating the number of mutations, categorizing mutations and identifying precise changes in comparison to a normal genome, can lead to better detection, diagnosis and treatment provided healthcare. The algorithm uses the reads to partially assemble the genome, with the help of an external existing de-novo assembly program. Once the reads are mapped into longer sequences, called ‘contigs’, the algorithm uses a dictionary type data-structure to reduce the number of comparisons between them.
- Minia assembler - A short-read assembler based on a de Bruijn graph, the output is a set of contigs. see more at Minia page.
minia command line we used:
./minia -in reads.fa -kmer-size 24 -abundance-min 3 -out output_prefix
The main parameters are:
- reads.fa* – the input file(s)
- kmer-size 24 – k-mer length (integer), the number may vary depending on user choice
- abundance-min 3 - hard cut-off to remove likely erroneous, low-abundance k-mers
- output_prefix – any prefix string to store output contigs as well as temporary files for this assembly
- edit-distance - Python module for computing edit distances and alignments between sequences. see more at edit-distancw page.
in 'main-code' folder run:
python3.6 run_compare_tissues helathy_file_path tumor_file_path output_prefix(optional) test(optional) test_num(optional)
The parameters are:
- helathy_file_path, tumor_file_path - contigs file in FASTA format
- test - this variable designed to assist in the software development process. If 'test' argument exist then the software will only run up to test_num contigs.
- test_num - int (optional) this parameter used only in case 'test' argument exist
The program outputs three reports:
- An object of mutation_distance containing quantities of point-mutations divided according to their types and nucleotides, in addition to the percentage of mutation per strings’ length.
- Diagrams per mutation type (png files)
- Sampling file that is a collection of already compared strings and the distance between them to illustrate mutations that were found in the genome (txt file).
For example - the replaces diagram as retulted in experiment on 10,000 contigs. The string 'AC' represents that 'A' was in the healthy tissue and was replaced
with 'C' in the tumor tissue and so on.
