Before you start

Before you consider using the SPLITT C++ library, try to answer the following questions for yourself.

  • Do you use pre-order or post-order traversal of a tree-like data-structure?

    If you don’t understand what I mean, chances are that you simply do not need to deal with SPLITT at that point. The reason is that I wrote SPLITT to solve a very specific problem, which I’ve encountered many times during my doctoral studies: perform the same type of calculation for all nodes in a big tree, where each node is related to some specific data and the calculation for one node uses as input the result from the calculation performed on its daughter nodes. This is what I mean by post-order traversal. A pre-order traversal is the same with only one difference – replace the term daughter nodes with the term parent node in the sentence above.

  • Do you need the tree-traversal to be fast?

    This very much depends on how frequently do you need to perform the tree-traversal. If you only need to do it once or a couple of times with a given dataset, then I don’t see much need of speeding this up. In my specific use-case, I needed to run a tree traversal millions of times and this was by far the heaviest computational operation during the analysis of the biological data at my disposal. It would not be an exaggeration, if I say that, without SPLITT, my PhD would have needed several more decades of computation time on a modern high performance computing cluster. This was reduced to a few months, thanks to the fast C++ implementation, based on SPLITT.

  • How big are your trees and how complex is your node traversal operation?

    When I said big tree I meant a tree of more than 100 tips, sometimes, more than 10000 tips. This is not a hard rule of thumb, but parallel traversal will have a performance benefit mostly on a big tree. This depends on other factors as well, such as the computational and memory complexity of the node traversal operation. If the tree is relatively small and the operation is very simple (e.g. the addition or multiplication of a few double-precision numbers), then it is very likely that using SPLITT in parallel mode would not be faster than using it in serial mode. Yet, using SPLITT might still be a good idea in this case if you currently have an implementation in a slowly interpreted language, such as R, and wish to have a faster C++ version - in this case a serial SPLITT-based implementation is probably going to be between 10 and 200 times faster than the R-version.

The main steps in using SPLITT

Using the SPLITT library consists in three steps:

  • Writing a TraversalSpecification class. In this C++ class, you tell SPLITT what to do with each node in the tree during a traversal.
  • Creating a TraversalTask for your TraversalSpecification. The TraversalTask object lives in the computer memory during the execution of your program. It wraps three objects:
    • Your input tree and data;
    • An object of your TraversalSpecification-class;
    • An object of a TraversalAlgorithm class, such as a PreOrderTraversal object or PostOrderTraversal object.
  • Executing the TraversalTask. Once you have created a TraversalTask object you call its TraverfseTree() method once or (preferably) multiple times, each time passing specific arguments, hereby, called parameters. The rest is to use the value produced at the end of the traversal for your specific application goal.

Prerequisites

Here is what you need before you can start working with SPLITT:

  • A C++11 compiler. So far I’ve been able to compile a SPLITT-based program on OS X and Linux using the following compilers:

    • clang++ version 6.0.0 on OS X. To my knowledge, this is the first version of clang++ that has the omp.h header, so it is possible to compile code using OpenMP. I am not an expert on the topic of C++ compilers, so I don’t know how the clang++ compiler behaves on other systems. On my Mac OS X, I’ve installed this compiler as part of my R toolchain. I’ve found these instructions very helpful. In case the above link has become outdated, or you wish to use clang without R, maybe start by reading the llvm project home-page.
    • Previous clang++ versions on OS X. Before installing clang++ v6.0.0, I’ve been able to compile SPLITT-code with earlier versions but these were not supporting OpenMP, so I was able to run SPLITT only in serial mode.
    • Intel compiler. Previously, I’ve been able to compile SPLITT-based code with the Intel compiler (Mac OS X and Linux). This compiler is likely to generate the fastest binary code running on Intel processors. You can try downloading the Intel compiler, following the instructions on the Intel C++ compilers page.
  • Optional: If you compile with OpenMP enabled (compile flag -fopenmp), then you need a copy of the omp shared library on your link path. On an OS X system, I’ve used the following:

    • Before R 3.5.0 was released, I used the libiomp5.dylib included with the Intel C++ compiler;
    • After R 3.5.0 was released, I’ve switched to the libomp.dylib included in R. On my system, this was available in /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/ as well as in /Library/Frameworks/R.framework/Resources/lib.
  • Optional: Rcpp and RcppArmadillo packages in R. Installing these packages will be needed if you wish to use SPLITT as a back-end for an R-package of your own.

The SPLITT API

Being a software library, SPLITT does not provide a high-level end-user interface. Rather, SPLITT is used via its application programming interface (API). In other words, you use the library by putting your application specific code at several locations in your program where SPLITT would expect it to find it. The easiest way to do this is to start from an example and use it as a skeleton for your own code, i.e. replace the code in the example with your application specific data-types, and traversal operations.

What is in the SPLITT online documetation

The goal of this web-site is to provide examples and a technical reference for SPLITT. This inormation is organized in the following pages:

  • The Writing a traversal specification guide shows how to define the application specific data-type and node-traversal operations for a simple mathematical model of trait evolution called “PMM”. You will not need to understand the biological interpretation of the model. Rahter, this tutorial will teach you what are the data types and methods that need to be defined in a TraversalSpecification-class. I think that this is the ideal place to start.
  • The View from above guide overviews the classes in the SPLITT API using a UML class diagram. If you are familiar with UML class diagrams, and/or prefer a top-down approach to understanding a given software, you can read this guide first, then, read the Writing a traversal specification.
  • The Running a traversal guide shows how to create and run a SPLITT traversal task. This tutorial uses TraversalSpecification-class defined in the Writing a traversal specification to build a simple console appication reading from the standard input std::cin and writing to the stadard output std::cout.
  • The Calling SPLITT from an R-package guide shows how to use Rcpp to wrap SPLITT traversal functionality into an R-package. This tutorial uses the TraversalSpecification-class defined in the Writing a traversal specification to build an Rcpp-module.

Examples of using the SPLITT library can be found in the following locations:

  • PMMUsingSPLITT-console-app is the console application described in Running a traversal.
  • PMMUsingSPLITT is the R-package described in Calling SPLITT from an R-package.
  • ThreePointUsingSPLITT-R-package is an R-package implementation the 3-point structure algorithm for calculating likelihoods of Gaussian phylogenetic models. Specifically, the 3-point structure algorithm is a linear time algorithm for calculating the determinant, \(|V|\), and any quadratic product of the form \(a^{T} V^{-1} b\), where \(V\) is a \(N \times N\) covariance matrix satisfying a 3-point structure, and a is a column vector of size N. This algorithm has been introduced in the article “A linear-time algorithm for Gaussian and non-Gaussian trait evolution models” by Lam si Tung Ho and Cecile Ane published in Systematic Biology in 2014.
  • BianaryPoissonUsingSPLITT-R-package is an R-package implementing a simple binary trait substitution model with constant substitution rates.
  • POUMM - the POUMM R-package, developed originally for the analysis of HIV virulence heritability (Mitov and Stadler 2018).
  • PCMBaseCpp - the PCMBaseCpp R-package, enabling parallel likelihood calculation for multivariate phylogenetic comparative models (Mitov et al. 2018).

References

Mitov, Venelin, and Tanja Stadler. 2018. “A Practical Guide to Estimating the Heritability of Pathogen Traits.” Molecular Biology and Evolution, msx328. doi:10.1093/molbev/msx328.

Mitov, Venelin, Krzysztof Bartoszek, Georgios Asimomitis, and Tanja Stadler. 2018. “Fast Likelihood Evaluation for Multivariate Phylogenetic Comparative Methods: The PCMBase R Package.” Arxiv Q-Bio.PE, arXiv:1809.09014. https://arxiv.org/abs/1809.09014.