Search for an optimal mixed Gaussian phylogenetic model, given a tree, trait measurements at its tips and a score function.

A mixed Gaussian phylogenetic model (MGPM) represents a Gaussian phylogenetic model with shifts in the underlying parameters and, optionally, type of Gaussian stochastic process (e.g. shifts from a BM to an OU model of evolution). The function PCMFitMixed implements a recursive clade partition (RCP) search for an approximate information score optimization. A deteailed description of the algorithm is provided in Appendix A in Mitov et al. 2019a. For this documentation, it is important to note that the algorithm proceeds in three steps as follows:

clade-fits: Model type fits to all clades in the tree bigger than a specified (see argument minCladeSize).
RCP: Recursive clade partition search.
Round-robin: Round-robin search for an optimal model type mapping in the best partitions found during the RCP-step. This step is optional.

PCMFitMixed(
  X,
  tree,
  modelTypes = MGPMDefaultModelTypes(),
  subModels = c(B = "A", C = "A", D = "B", E = "D", F = "E"),
  argsMixedGaussian = Args_MixedGaussian_MGPMDefaultModelTypes(),
  SE = matrix(0, nrow(X), PCMTreeNumTips(tree)),
  generatePCMModelsFun = PCMGenerateModelTypes,
  metaIFun = PCMInfo,
  positiveValueGuard = Inf,
  scoreFun = AIC,
  fitMappingsPrev = NULL,
  tableFitsPrev = fitMappingsPrev$tableFits,
  modelTypesInTableFitsPrev = NULL,
  listPartitions = NULL,
  minCladeSizes = 20L,
  skipNodes = character(),
  maxCladePartitionLevel = if (is.null(listPartitions)) 8L else 1L,
  maxNumNodesPerCladePartition = 1L,
  listAllowedModelTypesIndices = c("best-clade-2", "best-clade", "all"),
  argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10),
  argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4),
  argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10),
  maxNumRoundRobins = 0,
  maxNumPartitionsInRoundRobins = 2,
  listPCMOptions = PCMOptions(),
  skipFitWhenFoundInTableFits = TRUE,
  doParallel = FALSE,
  prefixFiles = "fits_",
  saveTempWorkerResults = TRUE,
  printFitVectorsToConsole = FALSE,
  verbose = TRUE,
  debug = FALSE
)

Arguments

X	a `k x N` numerical matrix with possible `NA` and `NaN` entries. For `i=1,..., N`, the column `i` of X contains the measured trait values for species `i` (the tip with integer identifier equal to `i` in `tree`). Missing values can be either not-available (`NA`) or not existing (`NaN`). These two values are treated differently when calculating likelihoods (see `PCMPresentCoordinates`).
tree	a phylo object with N tips.
modelTypes	a named character string vector. Each such string denotes a MGPM model class. The default setting is `MGPMDefaultModelTypes()`, which corresponds to the model types `A,B,C,D,E,F` as defined in Mitov et al. 2019a: A. BM (H = 0, diagonal $\Sigma$): BM, uncorrelated traits. B. BM (H = 0, symmetric $\Sigma$): BM, correlated traits. C. OU (diagonal H, diagonal $\Sigma$): OU, uncorrelated traits. D. OU (diagonal H, symmetric $\Sigma$): OU, correlated traits, but simple (diagonal) selection strength matrix. E. OU (symmetric H, symmetric $\Sigma$): An OU with nondiagonal symmetric H and nondiagonal symmetric $\Sigma$. F. OU (asymmetric H, symmetric $\Sigma$): An OU with nondiagonal asymmetric H and nondiagonal symmetric $\Sigma$. See also the argument `subModels` and `MGPMDefaultModelTypes`.
subModels	a named character string vector with names and elements among the names of `modelTypes`. This argument specifies a nesting relationship between the model types: a model A is a sub-model of a model B if A's parameters are a subset of B's parameters. During the RCP search, the parameters of a sub-model fit to a clade in the tree are used as a hint for the parameters of its encompassing models. The default setting is `c(B = 'A', C = 'A', D = 'B', E = 'D', F = 'E')`. Note that this setting is valid only for the default setting of the argument `modelTypes`. Setting this argument to `NULL` will turn off the use of nested models as hints.
argsMixedGaussian	a list of arguments passed to `MixedGaussian`. Default: `Args_MixedGaussian_MGPMDefaultModelTypes()`.
SE	a k x N matrix specifying the standard error for each measurement in X. Alternatively, a k x k x N cube specifying an upper triangular k x k factor of the variance covariance matrix for the measurement error for each tip `i=1, ..., N`. When `SE` is a matrix, the k x k measurement error variance matrix for a tip `i` is calculated as `VE[, , i] <- diag(SE[, i] * SE[, i], nrow = k)`. When `SE` is a cube, the way how the measurement variance matrix for a tip `i` is calculated depends on the runtime option `PCMBase.Transpose.Sigma_x` as follows: if `getOption("PCMBase.Transpose.Sigma_x", FALSE) == FALSE` (default): VE[, , i] <- SE[, , i] %% t(SE[, , i]) if `getOption("PCMBase.Transpose.Sigma_x", FALSE) == TRUE`: VE[, , i] <- t(SE[, , i]) %% SE[, , i] Note that the above behavior is consistent with the treatment of the model parameters `Sigma_x`, `Sigmae_x` and `Sigmaj_x`, which are also specified as upper triangular factors. Default: `matrix(0.0, PCMNumTraits(model), PCMTreeNumTips(tree))`.
generatePCMModelsFun	A function generating PCM model types. Specifying this function is only needed if the inference is to be done using #' custom candidate model types. Default setting: `PCMGenerateModelTypes`. Note that this function is called in each worker process and populates the global environment with `PCMParentClasses` and `PCMSpecify` S3 methods. See also `PCMGenerateModelTypes`.
metaIFun	a metaI function needed to generate meta- and cache- objects for the model objects generated during the RCP search. By default this is set to `PCMInfo` from the package PCMBase. For faster execution, it is recommended to use the function `PCMInfoCpp` from the package PCMBaseCpp.
positiveValueGuard	a real number (not necessarily positive) used during the fit as a threshold for highly positive but likely incorrect log-likelihood values. This argument is set to `Inf` by default and should be used only as a last escape. The recommended way to of preventing such numerical errors is by setting more stringent values for the runtime options `PCMBase.Threshold.EV` and `PCMBase.Threshold.SV` (see `PCMOptions`).
scoreFun	a information score function such as AIC or BIC. Default: `AIC`.
fitMappingsPrev	an object of S3 class PCMFitModelMappings, returned during a previous call to `PCMFitMixed`. Default: `NULL`. This can be used to enrich a previous MGPM inference with new model types. Currently, this feature is experimental. See also arguments `tableFitsPrev` and `modelTypesInTableFitsPrev`.
tableFitsPrev	a `data.table` object containing stored MGPM fits from a previous (possibly unfinished) call to `PCMFitMixed`. During a run `PCMFitMixed` stores all visited candidate MGPM models in a file named 'Current_X.RData' where X is the value of the argument `prefixFiles`. This file contains a list object of class 'PCMFitModelMappings' named `'listResults'`. One of the elements in `'listResults'` is a data.table `'tableFits'` that can be passed as a value of this argument. Default: `fitMappingsPrev$tableFits`. If a candidate model is found in tableFitsPrev, the previously inferred parameters will be reused unless the argument `skipFitWhenFoundInTableFits` is set to `FALSE`. See also the argument `skipFitWhenFoundInTableFits`.
modelTypesInTableFitsPrev	model types used for the fits in `'tableFitsPrev'`. Default: `NULL`, which means that the same model types are used as specified in the argument `modelTypes`.
listPartitions, minCladeSizes, skipNodes	arguments controlling the search for an optimal shift point configuration. If not `NULL`, `listPartitions` should be specified as a list of integer vectors with each such vector containing node indices in `tree` corresponding to a shift-point configuration (partition). If `listPartitions` is `NULL` (default) then the shift-point configuration are generated according to the RCP search algorithm and the arguments `minCladeSizes` and `skipNodes`. The argument `minCladeSizes` is an integer or a vector of integers - if a single integer it specifies the the minimum number of tips allowed in a part at al recursion levels of the RCP algorithm; if a vector, then each element specifies the minimal part size at each level of the algorithm (see also argument `maxCladePartitionLevel`). By default this is set to `20`. The argument `skipNodes` is used in combination with `minCladeSizes`. This is an integer or character vector indicating the ids or labels of nodes that should not be used as partition nodes. By default, this is an empty character vector. Default setting for the three arguments: `listPartitions = NULL, minCladeSizes = 20L, skipNodes = character()`. See also `maxCladePartitionLevel`, `maxNumNodesPerCladePartition` and `listAllowedModelTypesIndices`.
maxCladePartitionLevel	An integer limiting the number of recursive levels of the RCP search algorithm. Default setting: `if(is.null(listPartitions)) 8L else 1L`.
maxNumNodesPerCladePartition	An integer controlling the number of partition nodes in subtree-partitions during the RCP search. By default, this is set to `1`, meaning that only single shifts within a parent part will be searched. Setting this argument to a bigger value will make the RCP search algorithm 'less greedy' but will cause a significant slow-down in the search - many more partitions will be visited during the search.
listAllowedModelTypesIndices	Default setting: `c("best-clade-2", "best-clade", "all")`.
argsConfigOptim1, argsConfigOptim2, argsConfigOptim3	Arguments controlling the calls to `PCMFit` during the three steps of the RCP algorith. `argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10), argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4), argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10)`.
maxNumRoundRobins	number of round-robin iterations. By default this is set ot 0, i.e. no round-robin step is performed.
maxNumPartitionsInRoundRobins	maximum number (2 by default) of top partitions tobe included in the round-robin step. This argument only has an impact if `maxNumRoundRobins > 0`.
listPCMOptions	a list of PCM runtime options to be specified prior to running the fit. By default, this is set to `PCMOptions()`.
skipFitWhenFoundInTableFits	logical. Default: `TRUE`.
doParallel	logical indicating if the recursive clade partition search should be executed in parallel. Default: `FALSE`.
prefixFiles	a character string used to name a 'Current_X.RData' file stored during the inference, where X is to be replaced by the value of `prefixFiles`. Note that the name 'prefixFiles' can be misleading, because the actual file-name does not start with it - this is due for compliance with existing code. Default: `"fits_"`.
saveTempWorkerResults, printFitVectorsToConsole, debug	Debugging options controlling log-messages and storing of temporary results. These are mostly for internal use. Default setting: `saveTempWorkerResults, = TRUE, printFitVectorsToConsole = FALSE, debug = FALSE`
verbose	logical indicating if information messages should be printed to the console while running. Default: `TRUE`.

Value

an object of S3 class PCMFitModelMappings.

References

[Mitov et al. 2019a] Mitov, V., Bartoszek, K., & Stadler, T. (2019). Automatic generation of evolutionary hypotheses using mixed Gaussian phylogenetic models. Proceedings of the National Academy of Sciences of the United States of America, 35, 201813823. http://doi.org/10.1073/pnas.1813823116

[Mitov et al. 2019b] Mitov, V., Bartoszek, K., Asimomitis, G., & Stadler, T. (2019). Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts. Theoretical Population Biology. http://doi.org/10.1016/j.tpb.2019.11.005

Search for an optimal mixed Gaussian phylogenetic model, given a tree, trait measurements at its tips and a score function.

Arguments

Value

References

See also

Contents