In this tutorial, I show step by step how to write a TraversalSpecification-class for calculating the log-likelihood of the PMM model for a given phylogenetic tree, trait data, and model parameters. This model is a popular tool to study the evolution of quantitative traits in the field of comparative biology. You don’t need to understand this model in details, but if you are interested you can read more about it in this article. The example C++ code discussed in this tutorial is contained in two C++ header files:

  • SPLITT.h - this the SPLITT library header file;
  • NumericTrait.h - this file defines the application-specific data-type for the example;
  • AbcPMM.h - this header file includes the above two ones and defines the AbcPMM TraversalSpecification-class.

Two preliminary notes

For the critical minds among you, I’d like to make two notes regarding the design of SPLITT.

A note on generic programming

SPLITT was designed with abstraction in mind. Application-specific data-types and node-traversal operations are treated as abstract entities by the SPLITT code. This is achieved by coding in a style known as “generic programming”. Generic programming makes it possible to apply the same code to different types of data, without the need to write code for detecting the type of an object. A common way to do generic programming in C++ is to use template classes and functions. I recommend reading this tutorial in case you are not familiar with templates in C++.

A note on virtual methods

Why doesn’t SPLITT use virtual methods for the node-traversal operations?

I have been asked this question mostly by programmers in Java or other languages who are not familiar with C++ templates. In one word, the answer is performance. The power of virtual methods is that, at runtime, the correct method definition is called even if the exact class containing this definition was not known during the compilation. This can be very useful, e.g. in cases where the logical workflow of the program depends on the user input during the execution. Technically, this is accomplished through runtime dynamic binding as opposed to the compile-time static binding used in templates. The drawback of dynamic binding is that it disables some tricks done by the compiler to make our code faster. Among these tricks are method inlining and vectorized operations (if the methods is called within a loop). This is why, in designing SPLITT, I preferred using static binding whenever possible and restricting/delay the use of virtual methods only for the situations where this is really needed. Still, it is always possible to have the node-traversal operations call other virtual methods to achieve this form of polymorphism.

Traversal specifications

To use SPLITT, we write a TraversalSpecification-class. Briefly, this is a C++ class implementing several members, such as typedefs and methods with fixed names and argument types (more details to come later on). Once defined this class is passed as a template argument to a TraversalTask class and a TraversalAlgorithm class. In this way, the C++ compiler is told to generate binary code for the methods of the TraversalTask and TraversalAlgorithm classes that is specific for our TraversalSpecification.

The PMM log-likelihood example

This document is a step-by-step guide based on the example of the PMM log-likelihood calculation described in this article. You don’t need to understand the term PMM log-likelihood to read the sections below. Rather, think of it as a mathematical function that is evaluated for a combination of model parameters, \(\mathbf{\Theta}=<g_{M},\sigma,\sigma_{e}>\), and for a given data consisting of a tree and numerical trait values assigned to its tips. The PMM log-likelihood is expressed as the following quadratic polynomial function of \(g_{M}\):

\[\ell\ell(\mathbf{\Theta})=a_{M}g_{M}^2+b_{M}g_{M}+c_{M}\]

We apply post-order traversal to calculate the coefficients \(a_{M}\), \(b_{M}\) and \(c_{M}\). This procedure is illustrated on Fig. 1 in the article.

What is inside a TraversalSpecification class?

A TraversalSpecification class is a C++ class that must have a number of members (i.e. typedefs and methods), in order to be applicable by SPLITT during a tree traversal procedure. Optionally, the class can inherit from the class SPLITT::TraversalSpecification. This, however is not required and not checked during compilation. The following C++ code shows how the class is defined for the PMM-example:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  
  // Typedefs
  // Internal data-fields
  // Constructor
  // Methods
  
  };
} // END namespace PMMUsingSPLITT

The template-parameter <class Tree> allows to define AbcPMM classes for different Tree-types, e.g. for Trees with nodes of type int as well as for Trees with nodes of type std::string. SPLITT expects to find a number of members (typedefs and methods) in a TraversalSpecification class. We call the set of these members “the TraversalSpecification-interface”. In an object-oriented jargon, it would be more precise to call the class AbcPMM an implementation of the TraversalSpecification-interface or a TraversalSpecification-implementation. Because this is too long, below I use the term “the TraversalSpecification-class” whenever I mean TraversalSpecification-implementation, and I use SPLITT::TraversalSpecification whenever I mean the base-class.

Members inherited from the base-class (SPLITT::TraversalSpecification)

The base-class SPLITT::TraversalSpecification<Tree> defines the following entities:

  • a reference to a Tree object which has to be initialized during construction;
  • a constructor receiving a reference to a Tree-object, which is assigned to ref_tree_;
  • empty definitions of the three node-traversal operations (we will come to these later).
namespace PMMUsingSPLITT {

  // A recommended base-class for traversal-specifications
  template<class Tree> class TraversalSpecification {
  protected:
    // A reference to a Tree available for inheriting classes
    Tree const& ref_tree_;
    // A protected constructor that initializes the tree-reference. This constructor
    // must be called explicitly in the initalization list of inheriting class constructors.
    TraversalSpecification(Tree const& tree): ref_tree_(tree) {}
  public:  
    // Node-traversal operations - to be overwritten by inheriting classes
    void InitNode(uint i) {}
    void VisitNode(uint i) {}
    void PruneNode(uint i, uint i_parent) {}
  };
} // END namespace PMMUsingSPLITT

Members defined in the TraversalSpecification-class

Types

The TraversalSpecification-class has to define the following application-specific types:

MyType

This is a synonym for the TraversalSpecification-class’s own type. Defining this is optional but saves some typing of long class names later on.

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    typedef AbcPMM<Tree> MyType;
    //...
  };
} // END namespace PMMUsingSPLITT

BaseType

This is a synonym for the TraversalSpecification-class’s base-type. Also optional.

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef TraversalSpecification<Tree> BaseType;
    //...
  };
} // END namespace PMMUsingSPLITT

TreeType

Usually, this should be the class SPLITT::OrderedTree with specified template parameters for the type of node and branch-length in the tree. Because we want AbcPMM to be generic with respect to the node-type (i.e. int or std::string are possible such types), we postpone the specification, leaving Tree as a template parameter:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef Tree TreeType;
    //...
  };
} // END namespace PMMUsingSPLITT

ParameterType

This is the type of parameter passed before each traversal. In the case of AbcPMM, these are the PMM parameters \(\sigma^2\) and \(\sigma_{e}^2\). Since both of these parameters are real numbers, we specify that they are passed as a std::vector<double>. Below, vec is an alias for this type available in the SPLITT namespace:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef vec ParameterType; 
    //...
  };
} // END namespace PMMUsingSPLITT

DataType

This is the type of data associated with the tree object. While the parameters can be updated before each run of a tree traversal, the data is specified once at the time of constructing the TraversalSpecification object. In the case of the AbcPMM class the data represents a numeric trait measurement for each tip in the tree. Hence, it is an association map between tip-names and double values. We define this relationship in a separate class:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class NameType>
  struct NumericTraitData {
    // use const references to avoid copying of long vectors
    
    // The template parameter NameType should match the NodeType in the tree.
    std::vector<NameType> const& names_;
    vec const& x_;
    NumericTraitData(
      std::vector<NameType> const& names,
      vec const& x): names_(names), x_(x) {}
  };
} // END namespace PMMUsingSPLITT

Then, we define this class as the DataType in AbcPMM as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef NumericTraitData<typename TreeType::NodeType> DataType;
    //...
  };
} // END namespace PMMUsingSPLITT

StateType

This is the type of the state calculated for each node during the traversal. In the case of AbcPMM, these are the polynomial coefficients \(a_{i}\), \(b_{i}\) and \(c_{i}\) for a node \(i\) in the tree. SPLITT uses this type as a return type for the TraverseTree() method in TraversalAlgorithm classes. Again we use SPLITT::vec. Note though, that for the implementation of the node-traversal operations in the AbcPMM-class it will be more convenient to store the states in separate vectors a, b and c (more on that later).

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef vec StateType; 
    //...
  };
} // END namespace PMMUsingSPLITT

AlgorithmType

This is the type of the travefsal algorithm we want - either PostOrderTraversal or PreOrderTraversal:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    //...
    typedef PostOrderTraversal<MyType> AlgorithmType;
    //...
  };
} // END namespace PMMUsingSPLITT

Internal data fields

SPLITT does not impose any requirement on the internal data fields in the TraversalSpecification-class. These are all application-specific variables needed for the execution of the node-traversal operations. In the case of AbcPMM, these internal fields are defined as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    
    // Model parameters set before each traversal
    double sigmae2, sigma2;
    
    // Trait data copied from the input_data object during construction
    vec x;
    
    // Internal vectors used for storing the node-states, i.e. the coefficients 
    // a[i], b[i], c[i] for each node in the tree:
    vec a, b, c;
    //...
  };
} // END namespace PMMUsingSPLITT

Constructor

SPLITT requires the TraversalSpecification-class to have a constructor taking two arguments:

  • a const reference to a TreeType object;
  • a const reference to a DataType object;

Here is how the constructor is defined for the AbcPMM class:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    AbcPMM(TreeType const& tree, DataType const& input_data):
      BaseType(tree) {
      
      // Check that the number of data-points matches the number of tips in the tree.
      if(input_data.x_.size() != this->ref_tree_.num_tips()) {
        std::ostringstream oss;
        oss<<"The vector x must be the same length as the number of tips ("<<
          this->ref_tree_.num_tips()<<"), but were"<<input_data.x_.size()<<".";
        throw std::invalid_argument(oss.str());
      } else {
        
        // IMPORTANT: reorder the data points according to the order in the tree object. 
        uvec ordNodes = this->ref_tree_.OrderNodes(input_data.names_);
        this->x = At(input_data.x_, ordNodes);
        
        // a[i] corresponds to the
        this->a = vec(this->ref_tree_.num_nodes());
        this->b = vec(this->ref_tree_.num_nodes());
        this->c = vec(this->ref_tree_.num_nodes());
      }
    }
    //...
  };
} // END namespace PMMUsingSPLITT

Methods

SPLITT expect the following methods to be defined in a TraversalSpecification-class.

void SetParameter(ParameterType const& par)

This method is called before each traversal to set the values of the parameters. For the AbcPMM class, only \(\sigma^2\) and \(\sigma_{e}^2\) are needed to calculate the coefficients \(a_{M}\), \(b_{M}\) and \(c_{M}\). Hence we define the method as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    // Methods
    void SetParameter(ParameterType const& par) {
      if(par.size() != 2) {
        throw std::invalid_argument(
            "The par vector should be of length 2 with \
        elements corresponding to sigma2 and sigmae2.");
      }
      if(par[0] <= 0 || par[1] <= 0) {
        throw std::logic_error("The parameters sigma2 and sigmae2 should be positive.");
      }
      this->sigma2 = par[0];
      this->sigmae2 = par[1];
    }
    //...
  };
} // END namespace PMMUsingSPLITT

void InitNode(uint i)

This method is called on each node in the tree right after SetParameter and before any of the VisitNode and PruneNode methods has been called. There is no predefined order of the calls to InitNode and they may be executed in parallel for all nodes in the tree. Therefore, only node-specific data initialization, including the length of the branch leading to node i, can take place in this method. For the AbcPMM class, this is defined as follows (see also Fig. 1c in the article):

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    // Methods
    //...
    inline void InitNode(uint i) {
      
      if(i < this->ref_tree_.num_tips()) {
        a[i] = -0.5 / sigmae2;  
        b[i] = x[i] / sigmae2;
        c[i] = -0.5 * (x[i]*b[i] + log(2*G_PI*sigmae2));
      } else {
        a[i] = b[i] = c[i] = 0;
      }
    }
    //...
  };
} // END namespace PMMUsingSPLITT

void VisitNode(uint i)

This method is called on each tip or internal node (EXCLUDING THE ROOT) in the tree after PruneNode has been called on each descendant of i (if in a post-order traversal) or after VisitNode has been called on the parent of i (if in a pre-order traversal). In a post-order traversal, the method is the perfect place to calculate the state of node i using the previously calculated states of its descendants. This method SHOULD NOT BE USED FOR ALTERING THE STATE of i’s parent, because this could enter in conflict with a concurrent execution of VisitNode on a sibling of i (see also PruneNode below). For the AbcPMM class, this is defined as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    // Methods
    //...
    inline void VisitNode(uint i) {
      double t = this->ref_tree_.LengthOfBranch(i);
      double d = 1 - 2*a[i]*sigma2*t;
      
      // the order is important here because for c[i] we use the previous values 
      // of a[i] and b[i].
      c[i] = c[i] - 0.5*log(d) + 0.5*b[i]*b[i]*sigma2*t/d;
      a[i] /= d;
      b[i] /= d;
    }
    //...
  };
} // END namespace PMMUsingSPLITT

void PruneNode(uint i, uint j) (post-order traversal only)

This method is called on each tip or internal node (EXCLUDING THE ROOT) after VisitNode(i) and in sync with PruneNode(k, i_parent), for any sibling k of i. Thus, it is safe to use PruneNode to update the state of i_parent. For the AbcPMM class, this is defined as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    // Methods
    //...
    inline void PruneNode(uint i, uint j) {
      a[j] = a[j] + a[i];
      b[j] = b[j] + b[i];
      c[j] = c[j] + c[i];
    }
    //...
  };
} // END namespace PMMUsingSPLITT

StateType StateAtRoot() const (post-order traversal only)

This method is called after PruneNode has been called on each direct descendant of the root node, that is after the post-order traversal has been finished. If necessary, VisitNode(i_root) can be called here. For the AbcPMM class, this is defined as follows:

namespace PMMUsingSPLITT {
  // Calculates the quadratic polynomial coefficients a, b, c for the PMM log-likelihood.
  template<class Tree>
  class AbcPMM: public TraversalSpecification<Tree> {
  public:
    // Typedefs
    // Internal data-fields
    // Constructor
    // Methods
    //...
    StateType StateAtRoot() const {
      vec res(3);
      res[0] = a[this->ref_tree_.num_nodes() - 1];
      res[1] = b[this->ref_tree_.num_nodes() - 1];
      res[2] = c[this->ref_tree_.num_nodes() - 1];
      return res;
    };
  };
} // END namespace PMMUsingSPLITT

Next steps

Now, it may be useful to have a look at a class diagram of the SPLITT library. This would provide an overview of all classes, methods and typedefs used by SPLITT. You can such a diagram here.

Once having enjoyed this birth’s view on the library, the next step would be to see how the TraversalSpecification specification can be applied during a traversal on some example input data. This is described in Running a traversal.