Before you consider using the SPLITT C++ library, try to answer the following questions for yourself.
Do you use pre-order or post-order traversal of a tree-like data-structure?
If you don’t understand what I mean, chances are that you simply do not need to deal with SPLITT at that point. The reason is that I wrote SPLITT to solve a very specific problem, which I’ve encountered many times during my doctoral studies: perform the same type of calculation for all nodes in a big tree, where each node is related to some specific data and the calculation for one node uses as input the result from the calculation performed on its daughter nodes. This is what I mean by post-order traversal. A pre-order traversal is the same with only one difference – replace the term daughter nodes with the term parent node in the sentence above.
Do you need the tree-traversal to be fast?
This very much depends on how frequently do you need to perform the tree-traversal. If you only need to do it once or a couple of times with a given dataset, then I don’t see much need of speeding this up. In my specific use-case, I needed to run a tree traversal millions of times and this was by far the heaviest computational operation during the analysis of the biological data at my disposal. It would not be an exaggeration, if I say that, without SPLITT, my PhD would have needed several more decades of computation time on a modern high performance computing cluster. This was reduced to a few months, thanks to the fast C++ implementation, based on SPLITT.
How big are your trees and how complex is your node traversal operation?
When I said big tree I meant a tree of more than 100 tips, sometimes, more than 10000 tips. This is not a hard rule of thumb, but parallel traversal will have a performance benefit mostly on a big tree. This depends on other factors as well, such as the computational and memory complexity of the node traversal operation. If the tree is relatively small and the operation is very simple (e.g. the addition or multiplication of a few double-precision numbers), then it is very likely that using SPLITT in parallel mode would not be faster than using it in serial mode. Yet, using SPLITT might still be a good idea in this case if you currently have an implementation in a slowly interpreted language, such as R, and wish to have a faster C++ version - in this case a serial SPLITT-based implementation is probably going to be between 10 and 200 times faster than the R-version.
Using the SPLITT library consists in three steps:
TraversalSpecification class. In this C++ class, you tell SPLITT what to do with each node in the tree during a traversal.TraversalTask for your TraversalSpecification. The TraversalTask object lives in the computer memory during the execution of your program. It wraps three objects:
TraversalSpecification-class;TraversalAlgorithm class, such as a PreOrderTraversal object or PostOrderTraversal object.TraversalTask. Once you have created a TraversalTask object you call its TraverfseTree() method once or (preferably) multiple times, each time passing specific arguments, hereby, called parameters. The rest is to use the value produced at the end of the traversal for your specific application goal.Here is what you need before you can start working with SPLITT:
A C++11 compiler. So far I’ve been able to compile a SPLITT-based program on OS X and Linux using the following compilers:
Optional: If you compile with OpenMP enabled (compile flag -fopenmp), then you need a copy of the omp shared library on your link path. On an OS X system, I’ve used the following:
Optional: Rcpp and RcppArmadillo packages in R. Installing these packages will be needed if you wish to use SPLITT as a back-end for an R-package of your own.
Being a software library, SPLITT does not provide a high-level end-user interface. Rather, SPLITT is used via its application programming interface (API). In other words, you use the library by putting your application specific code at several locations in your program where SPLITT would expect it to find it. The easiest way to do this is to start from an example and use it as a skeleton for your own code, i.e. replace the code in the example with your application specific data-types, and traversal operations.
The goal of this web-site is to provide examples and a technical reference for SPLITT. This inormation is organized in the following pages:
TraversalSpecification-class. I think that this is the ideal place to start.TraversalSpecification-class defined in the Writing a traversal specification to build a simple console appication reading from the standard input std::cin and writing to the stadard output std::cout.TraversalSpecification-class defined in the Writing a traversal specification to build an Rcpp-module.Examples of using the SPLITT library can be found in the following locations:
Mitov, Venelin, and Tanja Stadler. 2018. “A Practical Guide to Estimating the Heritability of Pathogen Traits.” Molecular Biology and Evolution, msx328. doi:10.1093/molbev/msx328.
Mitov, Venelin, Krzysztof Bartoszek, Georgios Asimomitis, and Tanja Stadler. 2018. “Fast Likelihood Evaluation for Multivariate Phylogenetic Comparative Methods: The PCMBase R Package.” Arxiv Q-Bio.PE, arXiv:1809.09014. https://arxiv.org/abs/1809.09014.