project.character {Infusion} | R Documentation |
Learn a projection method for statistics and apply it
Description
project
is a generic function with two methods. If the first argument is a parameter name,
project.character
(alias: get_projector
) defines a projection function from several statistics to an output statistic predicting
this parameter. project.default
(alias: get_projection
) produces a vector of projected statistics using such a projection. project
is particularly useful to reduce a large number of summary statistics to a vector of projected summary statistics, with as many elements as parameters to infer. This dimension reduction can substantially speed up subsequent computations.
The concept implemented in project
is to fit a parameter to the various statistics available, using machine-learning or mixed-model prediction methods. All such methods can be seen as nonlinear projection to a one-dimensional space. project.character
is an interface that allows different projection methods to be used, provided they return an object of a class that has a defined predict
method with a newdata
argument (as expected, see predict
).
plot_proj
is a hastily written convenience function to plot a diagnostic plot for a projection from an object of class SLik_j
.
Usage
project(x,...)
## S3 method for building the projection
## S3 method for class 'character'
project(x, stats, data,
trainingsize= eval(Infusion.getOption("trainingsize")),
train_cP_size= eval(Infusion.getOption("train_cP_size")),
method, methodArgs=list(), verbose=TRUE,...)
get_projector(...) # alias for project.character
## S3 method for applying the projection
## Default S3 method:
project(x, projectors, use_oob=Infusion.getOption("use_oob"),
is_trainset=FALSE, methodArgs=list(), ...)
get_projection(...) # alias for project.default
plot_proj(object, parm, proj, xlab=parm, ylab=proj, ...)
Arguments
x |
The name of the parameter to be predicted, or a vector/matrix/list of matrices of summary statistics. |
stats |
Statistics from which the predictor is to be predicted |
use_oob |
Boolean: whether to use out-of-bag predictions for data used in the training set, when such oob predictions are available (i.e. for random forest methods). Default as controlled by the same-named package option, is TRUE. This by default involves a costly check on each row of the input |
is_trainset |
Boolean. Set it to TRUE if |
data |
A list of simulated empirical distributions, as produced by |
trainingsize , train_cP_size |
Integers;
for most projection methods (excluding |
method |
character string: |
methodArgs |
A list of arguments for the projection method. For For other methods, If if the projection method uses if the projection method uses |
projectors |
A list with elements of the form |
verbose |
Whether to print some information or not. In particular, |
object |
An object of class |
parm |
Character string: a parameter name. |
proj |
Character string: name of projected statistic. |
xlab , ylab |
Passed to |
... |
Further arguments passed to or from other functions. For |
Details
The preferred project
method is non-parametric regression by (variants of) the random forest method as implemented in ranger. It is the default method, if that package is installed. Alternative methods have been interfaced as detailed below, but the functionality of most interfaces is infrequently tested.
By default, the ranger call through project
will use the split rule "extratrees"
, with some other controls also differing from the ranger package defaults. If the split rule "variance"
is used, the default value of mtry
used in the call is also distinct from the ranger default, but consistent with Breiman 2001 for regression tasks.
Machine learning methods such as random forests overfit, except if out-of-bag predictions are used. When they are not, the bias is manifest in the fact that using the same simulation table for learning the projectors and for other steps of the analyses tend to lead to too narrow confidence regions. This bias disappears over iterations of refine
when the projectors are kept constant. Infusion
avoid this bias by using out-of-bag predictions when relevant, when ranger
and randomForest
are used. But it provides no code handling that problem for other machine-learning methods. Then, users should cope with that problems, and at a minimum should not update projectors in every iteration (the “Gentle Introduction to Infusion may contain further information about this problem”).
Prediction can be based on a linear mixed model (LMM) with autocorrelated random effects,
internally calling the corrHLfit
function with formula
<parameter> ~ 1+ Matern(1|<stat1>+...+<statn>)
. This approach allows in principle to produce arbitrarily
complex predictors (given sufficient input) and avoids overfitting in the same way as restricted likelihood methods avoids overfitting in LMM. REML methods are then used by default to estimate the smoothing parameters. However, faster methods are generally required.
To keep REML computation reasonably fast, the train_cP_size
and trainingsize
arguments determine respectively the size of the subset used to estimate the smoothing parameters and the size of the subset defining the predictor given the smoothing parameters. REML fitting is already slow for data sets of this size (particularly as the number of predictor variables increase).
If method="GCV"
, a generalized cross-validation procedure (Golub et al. 1979) is used to estimate the smoothing parameters. This is faster but still slow, so a random subset of size trainingsize
is still used to estimate the smoothing parameters and generate the predictor.
Alternatively, various machine-learning methods can be used (see e.g. Hastie et al., 2009, for an introduction). A random subset of size trainingsize
is again used, with a larger default value bearing the assumption that these methods are faster. Predefined methods include
-
"ranger"
, the default, a computationally efficient implementation of random forest; -
"randomForest"
, the older default, probably obsolete now; -
"neuralNet"
, a neural network method, using thetrain
function from thecaret
package (probably obsolete too); -
"fastai"
deep learning using thefastai
package; -
"keras"
deep learning using thekeras
package.
The last two interfaces may yet offer limited or undocumented control: using deep learning seems attractive but the benefits over "ranger"
are not clear (notably, the latter provide out-of-bag predictions that avoid overfitting).
In principle, any object suitable for prediction could be used as one of the projectors
, and Infusion
implements their usage so that in principle unforeseen projectors could be used. That is, if predictions of a parameter can be performed using an object MyProjector
of class MyProjectorClass
,
MyProjector
could be used in place of a project
result
if predict.MyProjectorClass(object,newdata,...)
is defined. However, there is no guarantee that this will work on unforeseen projection methods, as each method tends to have some syntactic idiosyncrasies. For example, if the learning method that generated the projector used
a formula-data syntax, then its predict
method is likely to request names for its newdata
, that need to be provided through attr(MyProjector,"stats")
(these names cannot be assumed to be in the newdata
when predict
is called through optim
).
Value
project.character
returns an object of class returned by the method
(methods "REML"
and "GCV"
will call corrHLfit
which returns an object of class spaMM
)
project.default
returns an object of the same class and structure as the input x
, containing the projected statistics inferred from the input summary statistics.
Note
See workflow examples in example_reftable
and example_raw_proj
.
References
Breiman, L. (2001). Random forests. Mach Learn, 45:5-32. <doi:10.1023/A:1010933404324>
Golub, G. H., Heath, M. and Wahba, G. (1979) Generalized Cross-Validation as a method for choosing a good ridge parameter. Technometrics 21: 215-223.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009.
Examples
## see Note for links to examples.