A mixed Gaussian phylogenetic model (MGPM) represents a Gaussian phylogenetic model with shifts in the underlying parameters and, optionally, type of Gaussian stochastic process (e.g. shifts from a BM to an OU model of evolution). The function PCMFitMixed implements a recursive clade partition (RCP) search for an approximate information score optimization. A deteailed description of the algorithm is provided in Appendix A in Mitov et al. 2019a. For this documentation, it is important to note that the algorithm proceeds in three steps as follows:

clade-fits

Model type fits to all clades in the tree bigger than a specified (see argument minCladeSize).

RCP

Recursive clade partition search.

Round-robin

Round-robin search for an optimal model type mapping in the best partitions found during the RCP-step. This step is optional.

PCMFitMixed(X, tree, modelTypes = MGPMDefaultModelTypes(),
  subModels = c(B = "A", C = "A", D = "B", E = "D", F = "E"),
  argsMixedGaussian = Args_MixedGaussian_MGPMDefaultModelTypes(),
  SE = matrix(0, nrow(X), PCMTreeNumTips(tree)),
  generatePCMModelsFun = PCMGenerateModelTypes, metaIFun = PCMInfo,
  positiveValueGuard = Inf, scoreFun = AIC, fitMappingsPrev = NULL,
  tableFitsPrev = fitMappingsPrev$tableFits,
  modelTypesInTableFitsPrev = NULL, listPartitions = NULL,
  minCladeSizes = 20L, skipNodes = character(),
  maxCladePartitionLevel = if (is.null(listPartitions)) 8L else 1L,
  maxNumNodesPerCladePartition = 1L,
  listAllowedModelTypesIndices = c("best-clade-2", "best-clade", "all"),
  argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10),
  argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4),
  argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10),
  maxNumRoundRobins = 0, maxNumPartitionsInRoundRobins = 2,
  listPCMOptions = PCMOptions(), skipFitWhenFoundInTableFits = TRUE,
  doParallel = FALSE, prefixFiles = "fits_",
  saveTempWorkerResults = TRUE, printFitVectorsToConsole = FALSE,
  verbose = TRUE, debug = FALSE)

Arguments

X

a k x N numerical matrix with possible NA and NaN entries. For i=1,..., N, the column i of X contains the measured trait values for species i (the tip with integer identifier equal to i in tree). Missing values can be either not-available (NA) or not existing (NaN). These two values are treated differently when calculating likelihoods (see PCMPresentCoordinates).

tree

a phylo object with N tips.

modelTypes

a named character string vector. Each such string denotes a MGPM model class. The default setting is MGPMDefaultModelTypes(), which corresponds to the model types A,B,C,D,E,F as defined in Mitov et al. 2019a:

A.

BM (H = 0, diagonal \(\Sigma\)): BM, uncorrelated traits.

B.

BM (H = 0, symmetric \(\Sigma\)): BM, correlated traits.

C.

OU (diagonal H, diagonal \(\Sigma\)): OU, uncorrelated traits.

D.

OU (diagonal H, symmetric \(\Sigma\)): OU, correlated traits, but simple (diagonal) selection strength matrix.

E.

OU (symmetric H, symmetric \(\Sigma\)): An OU with nondiagonal symmetric H and nondiagonal symmetric \(\Sigma\).

F.

OU (asymmetric H, symmetric \(\Sigma\)): An OU with nondiagonal asymmetric H and nondiagonal symmetric \(\Sigma\).

See also the argument subModels and MGPMDefaultModelTypes.

subModels

a named character string vector with names and elements among the names of modelTypes. This argument specifies a nesting relationship between the model types: a model A is a sub-model of a model B if A's parameters are a subset of B's parameters. During the RCP search, the parameters of a sub-model fit to a clade in the tree are used as a hint for the parameters of its encompassing models. The default setting is c(B = 'A', C = 'A', D = 'B', E = 'D', F = 'E'). Note that this setting is valid only for the default setting of the argument modelTypes. Setting this argument to NULL will turn off the use of nested models as hints.

argsMixedGaussian

a list of arguments passed to MixedGaussian. Default: Args_MixedGaussian_MGPMDefaultModelTypes().

SE

a k x N matrix specifying the standard error for each measurement in X. Alternatively, a k x k x N cube specifying an upper triangular k x k factor of the variance covariance matrix for the measurement error for each tip i=1, ..., N. When SE is a matrix, the k x k measurement error variance matrix for a tip i is calculated as VE[, , i] <- diag(SE[, i] * SE[, i], nrow = k). When SE is a cube, the way how the measurement variance matrix for a tip i is calculated depends on the runtime option PCMBase.Transpose.Sigma_x as follows:

if getOption("PCMBase.Transpose.Sigma_x", FALSE) == FALSE (default):

VE[, , i] <- SE[, , i] %*% t(SE[, , i])

if getOption("PCMBase.Transpose.Sigma_x", FALSE) == TRUE:

VE[, , i] <- t(SE[, , i]) %*% SE[, , i]

Note that the above behavior is consistent with the treatment of the model parameters Sigma_x, Sigmae_x and Sigmaj_x, which are also specified as upper triangular factors. Default: matrix(0.0, PCMNumTraits(model), PCMTreeNumTips(tree)).

generatePCMModelsFun

A function generating PCM model types. Specifying this function is only needed if the inference is to be done using #' custom candidate model types. Default setting: PCMGenerateModelTypes. Note that this function is called in each worker process and populates the global environment with PCMParentClasses and PCMSpecify S3 methods. See also PCMGenerateModelTypes.

metaIFun

a metaI function needed to generate meta- and cache- objects for the model objects generated during the RCP search. By default this is set to PCMInfo from the package PCMBase. For faster execution, it is recommended to use the function PCMInfoCpp from the package PCMBaseCpp.

positiveValueGuard

a real number (not necessarily positive) used during the fit as a threshold for highly positive but likely incorrect log-likelihood values. This argument is set to Inf by default and should be used only as a last escape. The recommended way to of preventing such numerical errors is by setting more stringent values for the runtime options PCMBase.Threshold.EV and PCMBase.Threshold.SV (see PCMOptions).

scoreFun

a information score function such as AIC or BIC. Default: AIC.

fitMappingsPrev

an object of S3 class PCMFitModelMappings, returned during a previous call to PCMFitMixed. Default: NULL. This can be used to enrich a previous MGPM inference with new model types. Currently, this feature is experimental. See also arguments tableFitsPrev and modelTypesInTableFitsPrev.

tableFitsPrev

a data.table object containing stored MGPM fits from a previous (possibly unfinished) call to PCMFitMixed. During a run PCMFitMixed stores all visited candidate MGPM models in a file named 'Current_X.RData' where X is the value of the argument prefixFiles. This file contains a list object of class 'PCMFitModelMappings' named 'listResults'. One of the elements in 'listResults' is a data.table 'tableFits' that can be passed as a value of this argument. Default: fitMappingsPrev$tableFits. If a candidate model is found in tableFitsPrev, the previously inferred parameters will be reused unless the argument skipFitWhenFoundInTableFits is set to FALSE. See also the argument skipFitWhenFoundInTableFits.

modelTypesInTableFitsPrev

model types used for the fits in 'tableFitsPrev'. Default: NULL, which means that the same model types are used as specified in the argument modelTypes.

listPartitions, minCladeSizes, skipNodes

arguments controlling the search for an optimal shift point configuration. If not NULL, listPartitions should be specified as a list of integer vectors with each such vector containing node indices in tree corresponding to a shift-point configuration (partition). If listPartitions is NULL (default) then the shift-point configuration are generated according to the RCP search algorithm and the arguments minCladeSizes and skipNodes. The argument minCladeSizes is an integer or a vector of integers - if a single integer it specifies the the minimum number of tips allowed in a part at al recursion levels of the RCP algorithm; if a vector, then each element specifies the minimal part size at each level of the algorithm (see also argument maxCladePartitionLevel). By default this is set to 20. The argument skipNodes is used in combination with minCladeSizes. This is an integer or character vector indicating the ids or labels of nodes that should not be used as partition nodes. By default, this is an empty character vector. Default setting for the three arguments: listPartitions = NULL, minCladeSizes = 20L, skipNodes = character(). See also maxCladePartitionLevel, maxNumNodesPerCladePartition and listAllowedModelTypesIndices.

maxCladePartitionLevel

An integer limiting the number of recursive levels of the RCP search algorithm. Default setting: if(is.null(listPartitions)) 8L else 1L.

maxNumNodesPerCladePartition

An integer controlling the number of partition nodes in subtree-partitions during the RCP search. By default, this is set to 1, meaning that only single shifts within a parent part will be searched. Setting this argument to a bigger value will make the RCP search algorithm 'less greedy' but will cause a significant slow-down in the search - many more partitions will be visited during the search.

listAllowedModelTypesIndices

Default setting: c("best-clade-2", "best-clade", "all").

argsConfigOptim1, argsConfigOptim2, argsConfigOptim3

Arguments controlling the calls to PCMFit during the three steps of the RCP algorith. argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10), argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4), argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10) .

maxNumRoundRobins

number of round-robin iterations. By default this is set ot 0, i.e. no round-robin step is performed.

maxNumPartitionsInRoundRobins

maximum number (2 by default) of top partitions tobe included in the round-robin step. This argument only has an impact if maxNumRoundRobins > 0.

listPCMOptions

a list of PCM runtime options to be specified prior to running the fit. By default, this is set to PCMOptions().

skipFitWhenFoundInTableFits

logical. Default: TRUE.

doParallel

logical indicating if the recursive clade partition search should be executed in parallel. Default: FALSE.

prefixFiles

a character string used to name a 'Current_X.RData' file stored during the inference, where X is to be replaced by the value of prefixFiles. Note that the name 'prefixFiles' can be misleading, because the actual file-name does not start with it - this is due for compliance with existing code. Default: "fits_".

saveTempWorkerResults, printFitVectorsToConsole, debug

Debugging options controlling log-messages and storing of temporary results. These are mostly for internal use. Default setting: saveTempWorkerResults, = TRUE, printFitVectorsToConsole = FALSE, debug = FALSE

verbose

logical indicating if information messages should be printed to the console while running. Default: TRUE.

Value

an object of S3 class PCMFitModelMappings.

References

[Mitov et al. 2019a] Mitov, V., Bartoszek, K., & Stadler, T. (2019). Automatic generation of evolutionary hypotheses using mixed Gaussian phylogenetic models. Proceedings of the National Academy of Sciences of the United States of America, 35, 201813823. http://doi.org/10.1073/pnas.1813823116

[Mitov et al. 2019b] Mitov, V., Bartoszek, K., Asimomitis, G., & Stadler, T. (2019). Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts. Theoretical Population Biology. http://doi.org/10.1016/j.tpb.2019.11.005

See also

PCMFitMixed PCMOptions