suffix-tree

Generalized suffix trees in Python built in linear time using Ukkonen's Algorithm. Implements the linear time LCA preprocessing algorithm with constant time retrieval. Supports a few applications of generalized suffix trees, as well as the ability to graph the tree using graphviz.

Construction

>>> tree = SuffixTree(['GATTACA', 'TAGACCA', 'ATACATA'])

Implemented Algorithms

Find all occurrences of a substring

>>> tree = SuffixTree(['GATTACA', 'TAGACCA', 'ATACATA'])
>>> tree.find('TA')
[(1, 0), (2, 1), (0, 3), (2, 5)]

The numbering here is (string_id, starting_index). For example, the third string (string_id = 2) has the substring 'TA' start at indexes 1 and 5. Note that the results are not sorted.

Longest Common Substring

>>> tree = SuffixTree(['GATTACA', 'TAGACCA', 'ATACATA'])
>>> tree.lcs()
['CA', 'AC', 'TA']

Graphing

>>> graph = tree.create_graph(suffix_link=False)  # Same as tree.create_graph()
>>> graph.render('/tmp/test.gv', view=True) # graph is a graphviz object

Creates a graph of the suffix tree using graphviz, with or without suffix links. The internal nodes are labeled with their 'depth first search' (dfs) numbers. The leaves are labeled as follows: dfs: [(string_id_0, starting_index_0), (string_id_1, starting_index_1), ...] where string_id is the index of the string in the array given to the tree upon construction. For example, in SuffixTree(['GATTACA', 'TAGACCA', 'ATACATA']), 'TAGACCA' would have string_id = 1.
Note that graphviz is a Python module but you must also download the graphviz software. I had a bit of trouble using graphviz in IDLE due to PATH issues (I added a line of code to hopefully fix this for others) but found it to work fine in terminal (on Mac).

Notes

Currently the dollar sign $ is used as a terminal character. You should change self._terminal_character if you are using strings with the $ sign for now.
The generalized suffix tree will use the same terminal character for every string. Some algorithms are typically described by using a different terminal character for each string, so keep this in mind if you are trying to learn from this code/add an algorithm.
This code is tested using problems from Rosalind but is not verified to be correct. If you are looking for speed, then you should use a different library. In addition to using a faster language, you may want to look for a suffix array alternative of an algorithm or use one of the distributed construction algorithms for suffix trees.
The LCA preprocessing algorithm needs some tuning (see Gusfield section 8.10. For the purists: how to avoid bit-level operations) to be truly linear and will probably run into stack overflow issues with too large a dataset due to recursion being used for DFS.

Helpful resources

I used these resources to help implement this

Ukkonen's original paper: https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Dan Gusfield)
https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english and the code provided from the answer https://stackoverflow.com/a/14580102
https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-1/ (helpful hints relating Gusfield to the stackoverflow answer)

I have attempted to provide a (messy, incomplete) proof of my algorithm's correctness related to Gusfield's algorithm for those interested. It is at the bottom of the file and helped me implement the algorithm. I will attempt to add more suffix tree applications as I complete the Rosalind problems.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
suffix_tree		suffix_tree
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

suffix-tree

Construction

Implemented Algorithms

Find all occurrences of a substring

Longest Common Substring

Graphing

Notes

Helpful resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

suffix-tree

Construction

Implemented Algorithms

Find all occurrences of a substring

Longest Common Substring

Graphing

Notes

Helpful resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages