Phylogenetic Comparative Methods in the Era of Big Data

Mitov V. 2018.

Ph.D. thesis

Examiner: Stadler T.

Co-examiners: Alizon S., Regös R.

Summary

Phylogenetic comparative methods (PCMs) are used for studying the evolution of various biological species, ranging from micro-organisms to animals and plants. These methods are based on the computer-assisted comparison of phenotype and molecular sequence data in populations of living and/or extinct species or organisms. With the rise of genome sequencing, it has become possible to infer the phylogenetic trees of large populations, such as the entire mammal clade, counting nearly 4000 species, or the transmission trees from large epidemic outbreaks, counting thousands to hundreds of thousands of infections. This has encouraged the transfer of PCMs developed originally for studying a few quantitative traits in a small phylogeny of living species to data much bigger in size and, sometimes, different in type.

In this thesis, I explore several difficulties encountered in the application of PCMs for the study of big phylogenetically linked comparative data. These range from technical problems, such as the development of fast algorithms for phylogenetic model inference to conceptual issues, such as the difference between an epidemic and a population of sexually reproducing organisms in estimating the heritability of a quantitative trait. My approach is a mixture of a top-down and a bottom-up strategy. At the high level, I start from poorly understood biological questions, for which comparative data has been available, and I identify particular issues hindering the use of existing PCMs to analyse that data. Then, I develop a prototype solving these issues for the data in question. Finally, I consider the prototype in a broader perspective, searching for possibilities to apply the same solution to a more general class of problems, without compromising the computational efficiency. This approach led to the development of several software tools, which, I hope, would prove useful in future studies.

The first chapter gives a general historical background and introduces the main concepts of PCMs. The rest of the thesis is divided in two parts. The part “Publications” (Chapters 2, 3 and 4) includes articles published during my doctoral studies. Chapter 2 introduces the field of phylogenetics and the software tools used for inferring phylogenetic trees based on molecular sequence data. Phylogenetic trees of that kind represent the main input for all methods described in the following chapters. In Chapter 3, I study the effects of within-host pathogen evolution on various estimators of the set-point viral load heritability in HIV patients. Based on simulations and real data of nearly ten thousand HIV patients, I show that neglecting or inaccurately accounting for within-host pathogen evolution has been the main cause for a long-standing discrepancy between different estimates of the set-point viral load heritability. Chapter 4 makes use of these results to estimate the heritability of two additional HIV traits: the CD4 cell decline and the per-parasite pathogenicity. The part “Manuscripts” (Chapters 5, 6 and 7) includes works which, at the time of submitting this thesis, are in revision or in preparation for submission to peer-reviewed journals. In Chapter 5, I develop generic algorithms for parallel traversal of phylogenetic trees, with “traversal” meaning the application of an abstract operation to all nodes in the tree, while respecting their hierarchical order. I implement these algorithms within the C++ library SPLITT intended as a fast back-end for higher level packages, such as generic PCM implementations. Chapter 6 describes one such tool – the R-package PCMBase implementing fast likelihood calculation of multi-trait Gaussian phylogenetic models. The poor efficiency of the likelihood calculation is the principal bottleneck in applying Gaussian phylogenetic models to big phylogenetic trees. PCMBase resolves this issue for a very large family of models and all types of phylogenetic trees, including non-ultrametric trees and polytomies. Taking advantage of PCMBase and SPLITT, in Chapter 7, I analyze the biggest published phylogeny of mammal species, for which brain and body mass measurements are available. Based on this data, I show that present-day PCMs are unable to model the heterogeneity of the evolutionary process across different mammal clades. As a solution, I propose a new method for inferring jointly a set of different evolutionary models on different parts of the tree.

Finally, in Chapter 8, I discuss the methods developed in this thesis and suggest directions for future research.