Here we show how to use patherit to estimate the broad-sense heritability, \(H^2\), of a numeric pathogen trait. \(H^2\) is defined as the proportion of the trait variance attributable to genotypic variation in a population of hosts (Mitov and Stadler 2016): \[\begin{equation} H^2=\frac{\text{Var}(g)}{\text{Var}(z)} \end{equation}\]

Thus, the heritability is a number in the interval [0,1]. The closer \(H^2\) is to 1, the higher is the importance of the pathogen genotype versus the host immune system in the formation of the trait. The input data represents a transmission tree connecting a set of infected patients and the patient’s trait-values, measured at the moment of diagnosis. The patherit package evaluates phylogenetic comparative estimators, i.e. \(H_{\bar{t}}^2\), \(H_{\infty}^2\) and \(H_e^2\) from the PMM and POUMM methods, and pair-correlation estimators, i.e. \(r_A\) from the ANOVA-CPP method (Mitov and Stadler 2016). The main reason for using different estimators is the need to gain certainty in the estimates and to eliminate potential sources of bias, (e.g. within-host evolution and wrong model assumptions), which can obfuscate the true value of \(H^2\) (Mitov and Stadler 2016).

For the purpose of this tutorial, we will use the toy-model epidemic simulated in the User guide of the toyepidemic package.

load("epid1.RData")

Extracting phylogenetic and phenotypic data from the simulated epidemic

The code-snippet below extracts the exact transmission tree connecting a set of diagnosed patients during a simulated epidemic outbreak.

tree <- toyepidemic::extractTree(epidemic)

# a data.table containing the host-type (env), the infecting strain (gene) and the
# host specific effect (e) for each infected patient. 
pop <- toyepidemic::extractPop(epidemic, ids=tree$tip.label)

# use the function calcValue from the toyepidemic package to extract the trait
# value for each sampled patient.
pop[, z:=toyepidemic::calcValue(env, gene, e, GEValues=epidemic$GEValues)]

# arrange the trait values in order of the patient-labels in the tree and store
# them in an external vector.
z <- pop[list(tree$tip.label), z]

Visualizing the data

plot(ape::ladderize(tree), show.tip.label = FALSE, type="fan", no.margin = TRUE)
colorFunc <- colorRampPalette(c("blue", "red"))
colors <- colorFunc(length(tree$tip.label))
ape::tiplabels(NULL, tip=1:length(tree$tip.label), pch=20, col=colors[order(z)], cex=0.5)

Qualitatively, the heritability of the trait manifests itself in the similar color for closely related patients on the transmission tree. This is because closely related patients such as closest phylogenetic pairs (CPPs) tend to carry identical pathogen strains.

patherit::boxplotTraitAlongTree(z, tree, nGroups = 8, boxwex=6)

The absence of a noticeable trend and the relative symmetry of the box-whiskers around the medians suggest that the trait distribution is approximately normal and stationary throughout the epidemic. Thus, we should expect that the POUMM estimates \(H_{\bar{t}}^2\) and \(H_{\infty}^2\) should be approximately equal.

Estimating the heritability using ANOVA-CPP, POUMM and the PMM methods.

# start a parallel cluster for parallel execution of MCMC chains in the POUMM
# and PMM fits (explained in the vignette for the POUMM package).
cluster <- parallel::makeCluster(parallel::detectCores(logical = FALSE))
doParallel::registerDoParallel(cluster)

H2Analysis <- patherit::estimateH2(z, tree, 
                                   methods = list(
                                     PP = list(bootstraps=100, verbose=FALSE), 
                                     POUMM = list(nSamplesMCMC = 5e5, 
                                                  verbose = FALSE), 
                                     PMM = list(nSamplesMCMC = 5e5, 
                                                verbose = FALSE)))

Summarizing the estimates:

summary(H2Analysis)[stat%in%c("rA", "H2tMean", "H2e", "H2tInf")]
##     method tauQuantile    N   K    stat  MLE PostMean CI.lower CI.upper tauMean filter
##  1:     PP    V[0%,5%]   28  14      rA 0.67       NA     0.26     0.88     3.3    all
##  2:     PP   D[0%,10%]   56  28      rA 0.55       NA     0.24     0.76     4.4    all
##  3:     PP   O[0%,12%]   70  35      rA 0.55       NA     0.27     0.74     4.9    all
##  4:     PP  qu[0%,20%]  112  56      rA 0.63       NA     0.45     0.77     6.1    all
##  5:     PP   Q[0%,25%]  138  69      rA 0.57       NA     0.39     0.71     6.7    all
##  6:     PP   M[0%,50%]  278 139      rA 0.54       NA     0.41     0.65    10.7    all
##  7:     PP  A[0%,100%]  552 276      rA 0.43       NA     0.33     0.52    19.5    all
##  8:  POUMM          NA 1000  NA H2tMean 0.62     0.62     0.49     0.74      NA     NA
##  9:  POUMM          NA 1000  NA     H2e 0.61     0.60     0.47     0.73      NA     NA
## 10:  POUMM          NA 1000  NA  H2tInf 0.62     0.62     0.49     0.74      NA     NA
## 11:    PMM          NA 1000  NA H2tMean 0.70     0.69     0.56     0.81      NA     NA
## 12:    PMM          NA 1000  NA     H2e 0.35     0.34     0.25     0.44      NA     NA
## 13:    PMM          NA 1000  NA  H2tInf 1.00     1.00     1.00     1.00      NA     NA

Generating a correlation profile

It is useful to observe the correlation between phylogenetic pairs as a function of their phylogenetic distance. In the absence of convergent evolution towards the same pathogen, the sequence similarity between transmission partners is supposed to decrease with the time since the transmission events. This is reflected by a decay in the trait correlaiton between transmission partners and phylogenetic pairs, respectively. The correlation profile represents a suite of estimates of the intra-class correlation (\(r_A\)) in a stratification of the phylogenetic patients by their phylogenetic distance. The patherit package allows to produce such a profile for the original data, and for data simulated under the ML fits of the POUMM and the PMM. This is done with the patherit::corrProfile() function:

corrPr <- patherit::corrProfile(H2Analysis)

Then, we use the function patherit::plotCorrProfile() to plot the correlation values in different stratifications (medians, quartiles, quantiles, deciles, etc…):

patherit::plotCorrProfile(H2Analysis, corrPr)

Notice that the PMM estimates remain high for all phylogenetic distances, while the true data and the simulations under the POUMM ML fit are showing a decline in the correlation between phylogenetic pairs with phylogenetic distance. This shows that the PMM cannot model the decay in correlaiton (Mitov and Stadler 2016).

Comparing to the true value

For the toy-model simulation, we know the true value of \(H^2\). This is given by the direct estimator - the coefficient of determination, \(R_{adj}^2\), calculated over a grouping of the host population by identical infecting strain:

R2adj(data=pop, activeOnly = TRUE)
## [1] 0.55

We see that the true value falls well within the 95% CI of all estimators except, the estimators \(H_e^2\) from the PMM method. This is due to the fact that the PMM has over-estimated the non-heritable variance \(\sigma_e^2\) - another manifestation of the PMM’s inability to model the decay of correlation between phylogenetic pairs.

Randomization test

The randomization test consists in checking that the estimated heritability is insignificant if the trait values have been shuffled randomly between the sampled patients:

pop[, z2:=sample(z)]
zR <- pop[, z2]
H2AnalysisR <- estimateH2(zR, tree, 
                         methods = list(
                           PP = list(bootstraps=100, verbose=FALSE), 
                           POUMM = list(nSamplesMCMC = 5e5, 
                                        verbose = FALSE), 
                           PMM = list(nSamplesMCMC = 5e5, 
                              verbose = FALSE)))

parallel::stopCluster(cluster)

Summarizing the estimates:

summary(H2AnalysisR)[stat%in%c("rA", "H2tMean", "H2e", "H2tInf")]
##     method tauQuantile    N   K    stat     MLE PostMean  CI.lower CI.upper tauMean filter
##  1:     PP    V[0%,5%]   28  14      rA  0.3049       NA -0.232088    0.705     3.3    all
##  2:     PP   D[0%,10%]   56  28      rA -0.0180       NA -0.378504    0.350     4.4    all
##  3:     PP   O[0%,12%]   70  35      rA -0.0559       NA -0.375187    0.277     4.9    all
##  4:     PP  qu[0%,20%]  112  56      rA -0.0227       NA -0.280461    0.239     6.1    all
##  5:     PP   Q[0%,25%]  138  69      rA -0.0872       NA -0.314903    0.150     6.7    all
##  6:     PP   M[0%,50%]  278 139      rA  0.0106       NA -0.155350    0.176    10.7    all
##  7:     PP  A[0%,100%]  552 276      rA  0.0283       NA -0.089757    0.146    19.5    all
##  8:  POUMM          NA 1000  NA H2tMean  0.0000   0.0658  0.000065    0.212      NA     NA
##  9:  POUMM          NA 1000  NA     H2e  0.0019   0.0597 -0.075000    0.231      NA     NA
## 10:  POUMM          NA 1000  NA  H2tInf  0.0000   0.0664  0.000065    0.212      NA     NA
## 11:    PMM          NA 1000  NA H2tMean  0.0000   0.0167  0.000020    0.054      NA     NA
## 12:    PMM          NA 1000  NA     H2e  0.0010   0.0051 -0.077709    0.094      NA     NA
## 13:    PMM          NA 1000  NA  H2tInf     NaN   1.0000  1.000000    1.000      NA     NA

Correlation profile

corrPrR <- corrProfile(H2AnalysisR)
plotCorrProfile(H2AnalysisR, corrPrR)

Comparing to the true value in the randomized data

pop[, z:=z2]

# should be insignificant
R2adj(data=pop, activeOnly = TRUE)
## [1] -0.0017

Packages used

Apart from base R functionality, the patherit package uses a number of 3rd party R-packages:

  • For tree processing: ape v5.0 (Paradis et al. 2016);
  • For reporting: data.table v1.10.4.3 (Dowle and Srinivasan 2016);
  • For the POUMM and PP estimates: POUMM v2.0.0 (Mitov and Stadler 2017), boot v1.3.18 (Canty and Ripley 2016);
  • For testing: testthat v1.0.2 (Wickham 2016).

References

Canty, Angelo, and Brian Ripley. 2016. Boot: Bootstrap R (S-Plus) Functions. R-Package. https://cran.r-project.org/web/packages/boot/index.html.

Dowle, Matt, and Arun Srinivasan. 2016. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Mitov, Venelin, and Tanja Stadler. 2016. “The heritability of pathogen traits - definitions and estimators.” Unpublished Data https://www.biorxiv.org/content/early/2016/06/12/058503.

———. 2017. Fast and Robust Inference of Phylogenetic Ornstein-Uhlenbeck Models Using Parallel Likelihood Calculation. BiorXiv. http://biorxiv.org/content/early/2017/05/30/115089.

Paradis, Emmanuel, Simon Blomberg, Ben Bolker, Julien Claude, Hoa Sien Cuong, Richard Desper, Gilles Didier, et al. 2016. Ape: Analyses of Phylogenetics and Evolution. https://CRAN.R-project.org/package=ape.

Wickham, Hadley. 2016. Testthat: Unit Testing for R. https://CRAN.R-project.org/package=testthat.