A mixed Gaussian phylogenetic model (MGPM) represents a
Gaussian phylogenetic model with shifts in the underlying parameters and,
optionally, type of Gaussian stochastic process (e.g. shifts from a BM to an
OU model of evolution). The function PCMFitMixed
implements a
recursive clade partition (RCP) search for an approximate information score
optimization. A deteailed description of the algorithm is provided in
Appendix A in Mitov et al. 2019a. For this documentation, it is important to
note that the algorithm proceeds in three steps as follows:
- clade-fits
Model type fits to all clades in the tree bigger than a specified
(see argument minCladeSize
).
- RCP
Recursive clade partition search.
- Round-robin
Round-robin search for an optimal model type mapping in
the best partitions found during the RCP-step. This step is optional.
PCMFitMixed(
X,
tree,
modelTypes = MGPMDefaultModelTypes(),
subModels = c(B = "A", C = "A", D = "B", E = "D", F = "E"),
argsMixedGaussian = Args_MixedGaussian_MGPMDefaultModelTypes(),
SE = matrix(0, nrow(X), PCMTreeNumTips(tree)),
generatePCMModelsFun = PCMGenerateModelTypes,
metaIFun = PCMInfo,
positiveValueGuard = Inf,
scoreFun = AIC,
fitMappingsPrev = NULL,
tableFitsPrev = fitMappingsPrev$tableFits,
modelTypesInTableFitsPrev = NULL,
listPartitions = NULL,
minCladeSizes = 20L,
skipNodes = character(),
maxCladePartitionLevel = if (is.null(listPartitions)) 8L else 1L,
maxNumNodesPerCladePartition = 1L,
listAllowedModelTypesIndices = c("best-clade-2", "best-clade", "all"),
argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10),
argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4),
argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10),
maxNumRoundRobins = 0,
maxNumPartitionsInRoundRobins = 2,
listPCMOptions = PCMOptions(),
skipFitWhenFoundInTableFits = TRUE,
doParallel = FALSE,
prefixFiles = "fits_",
saveTempWorkerResults = TRUE,
printFitVectorsToConsole = FALSE,
verbose = TRUE,
debug = FALSE
)
Arguments
X |
a k x N numerical matrix with possible NA and
NaN entries. For i=1,..., N , the column i of X contains
the measured trait values for species i (the tip with integer
identifier equal to i in tree ). Missing values can be either
not-available (NA ) or not existing (NaN ). These two values are
treated differently when calculating likelihoods (see
PCMPresentCoordinates ). |
tree |
a phylo object with N tips. |
modelTypes |
a named character string vector. Each such
string denotes a MGPM model class. The default setting is
MGPMDefaultModelTypes() , which corresponds to the model types
A,B,C,D,E,F as defined in Mitov et al. 2019a:
- A.
BM (H = 0, diagonal \(\Sigma\)): BM, uncorrelated traits.
- B.
BM (H = 0, symmetric \(\Sigma\)): BM, correlated traits.
- C.
OU (diagonal H, diagonal \(\Sigma\)): OU, uncorrelated traits.
- D.
OU (diagonal H, symmetric \(\Sigma\)): OU, correlated traits, but simple
(diagonal) selection strength matrix.
- E.
OU (symmetric H, symmetric \(\Sigma\)): An OU with nondiagonal symmetric H
and nondiagonal symmetric \(\Sigma\).
- F.
OU (asymmetric H, symmetric \(\Sigma\)): An OU with nondiagonal asymmetric
H and nondiagonal symmetric \(\Sigma\).
See also the argument subModels and
MGPMDefaultModelTypes . |
subModels |
a named character string vector with names and elements
among the names of modelTypes . This argument specifies a nesting
relationship between the model types: a model A is a sub-model of a
model B if A's parameters are a subset of B's parameters. During the RCP
search, the parameters of a sub-model fit to a clade in the tree are used as
a hint for the parameters of its encompassing models. The default setting is
c(B = 'A', C = 'A', D = 'B', E = 'D', F = 'E') . Note that this
setting is valid only for the default setting of the argument
modelTypes . Setting this argument to NULL will turn off the
use of nested models as hints. |
argsMixedGaussian |
a list of arguments passed to
MixedGaussian . Default:
Args_MixedGaussian_MGPMDefaultModelTypes() . |
SE |
a k x N matrix specifying the standard error for each measurement in
X. Alternatively, a k x k x N cube specifying an upper triangular k x k
factor of the variance covariance matrix for the measurement error
for each tip i=1, ..., N . When SE is a matrix, the k x k
measurement error variance matrix for a tip i is calculated as
VE[, , i] <- diag(SE[, i] * SE[, i], nrow = k) . When SE is a
cube, the way how the measurement variance matrix for a tip i is
calculated depends on the runtime option PCMBase.Transpose.Sigma_x
as follows:
- if
getOption("PCMBase.Transpose.Sigma_x", FALSE) == FALSE (default): VE[, , i] <- SE[, , i] %*% t(SE[, , i])
- if
getOption("PCMBase.Transpose.Sigma_x", FALSE) == TRUE : VE[, , i] <- t(SE[, , i]) %*% SE[, , i]
Note that the above behavior is consistent with the treatment of the model
parameters Sigma_x , Sigmae_x and Sigmaj_x , which are
also specified as upper triangular factors.
Default: matrix(0.0, PCMNumTraits(model), PCMTreeNumTips(tree)) . |
generatePCMModelsFun |
A function generating PCM model types.
Specifying this function is only needed if the inference is to be done using #' custom candidate model types. Default setting:
PCMGenerateModelTypes . Note that this function is called in
each worker process and populates the global environment with
PCMParentClasses and PCMSpecify S3 methods. See
also PCMGenerateModelTypes . |
metaIFun |
a metaI function needed to generate meta- and cache- objects
for the model objects generated during the RCP search. By default this is
set to PCMInfo from the package PCMBase. For faster execution,
it is recommended to use the function PCMInfoCpp from the package
PCMBaseCpp. |
positiveValueGuard |
a real number (not necessarily positive) used during
the fit as a threshold for highly positive but likely incorrect log-likelihood
values. This argument is set to Inf by default and should be used only
as a last escape. The recommended way to of preventing such numerical errors
is by setting more stringent values for the runtime options
PCMBase.Threshold.EV and PCMBase.Threshold.SV (see
PCMOptions ). |
scoreFun |
a information score function such as AIC or BIC.
Default: AIC . |
fitMappingsPrev |
an object of S3 class PCMFitModelMappings, returned
during a previous call to PCMFitMixed . Default: NULL . This can
be used to enrich a previous MGPM inference with new model types. Currently,
this feature is experimental. See also arguments tableFitsPrev and
modelTypesInTableFitsPrev . |
tableFitsPrev |
a data.table object containing stored
MGPM fits from a previous (possibly unfinished) call to PCMFitMixed .
During a run PCMFitMixed stores all visited candidate MGPM models in
a file named 'Current_X.RData' where X is the value of the argument
prefixFiles . This file contains a list object of class
'PCMFitModelMappings' named 'listResults' . One of the elements in
'listResults' is a data.table 'tableFits' that can be passed
as a value of this argument. Default: fitMappingsPrev$tableFits .
If a candidate model is found in tableFitsPrev, the previously inferred
parameters will be reused unless the argument
skipFitWhenFoundInTableFits is set to FALSE . See
also the argument skipFitWhenFoundInTableFits . |
modelTypesInTableFitsPrev |
model types used for the fits in
'tableFitsPrev' . Default: NULL , which means that the same
model types are used as specified in the argument modelTypes . |
listPartitions, minCladeSizes, skipNodes |
arguments controlling the
search for an optimal shift point configuration. If not NULL ,
listPartitions should be specified as a list of integer vectors with
each such vector containing node indices in tree corresponding to a
shift-point configuration (partition). If listPartitions is
NULL (default) then the shift-point configuration are generated
according to the RCP search algorithm and the arguments minCladeSizes
and skipNodes . The argument minCladeSizes is an integer or a
vector of integers - if a single integer it specifies the the minimum number
of tips allowed in a part at al recursion levels of the RCP algorithm; if a
vector, then each element specifies the minimal part size at each level of
the algorithm (see also argument maxCladePartitionLevel ). By default
this is set to 20 . The argument skipNodes is used in
combination with minCladeSizes . This is an integer or character
vector indicating the ids or labels of nodes that should not be used as
partition nodes. By default, this is an empty character vector. Default
setting for the three arguments:
listPartitions = NULL,
minCladeSizes = 20L,
skipNodes = character() . See also maxCladePartitionLevel ,
maxNumNodesPerCladePartition and
listAllowedModelTypesIndices . |
maxCladePartitionLevel |
An integer limiting the number of recursive
levels of the RCP search algorithm. Default setting:
if(is.null(listPartitions)) 8L else 1L . |
maxNumNodesPerCladePartition |
An integer controlling the number of
partition nodes in subtree-partitions during the RCP search. By default,
this is set to 1 , meaning that only single shifts within a parent part
will be searched. Setting this argument to a bigger value will make the
RCP search algorithm 'less greedy' but will cause a significant slow-down
in the search - many more partitions will be visited during the search. |
listAllowedModelTypesIndices |
Default setting: c("best-clade-2", "best-clade", "all") . |
argsConfigOptim1, argsConfigOptim2, argsConfigOptim3 |
Arguments
controlling the calls to PCMFit during the three steps of the
RCP algorith.
argsConfigOptim1 = DefaultArgsConfigOptim(numCallsOptim = 10),
argsConfigOptim2 = DefaultArgsConfigOptim(numCallsOptim = 4),
argsConfigOptim3 = DefaultArgsConfigOptim(numCallsOptim = 10)
. |
maxNumRoundRobins |
number of round-robin iterations. By default this is
set ot 0, i.e. no round-robin step is performed. |
maxNumPartitionsInRoundRobins |
maximum number (2 by default) of top
partitions tobe included in the round-robin step. This argument only has an
impact if maxNumRoundRobins > 0 . |
listPCMOptions |
a list of PCM runtime options to be specified prior
to running the fit. By default, this is set to PCMOptions() . |
skipFitWhenFoundInTableFits |
logical. Default: TRUE . |
doParallel |
logical indicating if the recursive clade partition search
should be executed in parallel. Default: FALSE . |
prefixFiles |
a character string used to name a 'Current_X.RData' file
stored during the inference, where X is to be replaced by the value of
prefixFiles . Note that the name
'prefixFiles' can be misleading, because the actual file-name does not start
with it - this is due for compliance with existing code. Default:
"fits_" . |
saveTempWorkerResults, printFitVectorsToConsole, debug |
Debugging options controlling log-messages and storing of temporary results.
These are mostly for internal use. Default setting:
saveTempWorkerResults, = TRUE, printFitVectorsToConsole = FALSE,
debug = FALSE |
verbose |
logical indicating if information messages should be printed
to the console while running. Default: TRUE . |
Value
an object of S3 class PCMFitModelMappings.
References
[Mitov et al. 2019a] Mitov, V., Bartoszek, K., & Stadler, T. (2019).
Automatic generation of evolutionary hypotheses using mixed Gaussian
phylogenetic models. Proceedings of the National Academy of Sciences of the
United States of America, 35, 201813823.
http://doi.org/10.1073/pnas.1813823116
[Mitov et al. 2019b] Mitov, V., Bartoszek, K., Asimomitis, G., & Stadler, T. (2019). Fast
likelihood calculation for multivariate Gaussian phylogenetic models with
shifts. Theoretical Population Biology.
http://doi.org/10.1016/j.tpb.2019.11.005
See also