fill {BAT}R Documentation

Filling missing data.

Description

Estimation of missing trait values (NA) based on different methods.

Usage

fill(trait, method = "regression", group = NULL, weight = NULL, step = TRUE)

Arguments

trait

A species x traits matrix (a species or individual for each row and traits as columns).

method

Method for imputing missing data. One of "mean" (mean value of the trait), "median" (median value of the trait), "similar" (input from closest species), "regression" (linear regression), "w_regression" (regression weighted by species distance), or "PCA" (Principal Component Analysis).

group

A vector (string of characters, factorial, etc.) whose values indicate which species belong to the same group as the missing and should be used in the estimation of missing data. If NULL all species will be used.

weight

A hclust, phylo or dist object to calculate the distance between species and use as weights. Note that the order of tip labels in trees or of species in the distance matrix should be the same as the order of species in trait.

step

A boolean (T/F) indicating if a stepwise regression model based on AIC should be performed. Ignored is regression is not used.

Details

Inputs missing data in the trait matrix based on different methods (see Taugourdeau et al. 2014; Johnson et al. 2021 for comparisons among the performance of different methods). The simplest approach is the average imputation ("mean" or "median"), calculating the mean/median of the values for that trait based on all the observations that are non-missing. It has the advantage of keeping the same mean and the same sample size, but many disadvantages. The "similar" method inputs a systematically chosen value from the closest species who has similar values on other variables. The default method is linear regression ("regression"), where the predicted value is obtained by regressing the missing variable on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values (i.e., may lead to extrapolations). The "w_regression" takes into account the relative distance among species in the imputation of missing traits, based on the phylogenetic or functional distance between missing and non-missing species. The "PCA" method performs PCA with incomplete data sensu Podani et al. (2021). Note that for PCA and regressions methods the performance of the prediction increases as the number of collinear traits increase.

Value

A trait matrix with missing data (NA) filled with predicted values. If method = "PCA" the function returns the standard output of a principal component analysis as a list with: Eigenvalues Positive eigenvalues Positive eigenvalues as percent Square root of eigenvalues Eigenvectors Component scores Variable scores Object scores in a biplot Variable scores in a biplot

References

Johnson, T.F., Isaac, N.J., Paviolo, A. & Gonzalez-Suarez, M. (2021). Handling missing values in trait data. Global Ecology and Biogeography, 30: 51-62.

Podani, J., Kalapos, T., Barta, B. & Schmera, D. (2021). Principal component analysis of incomplete data. A simple solution to an old problem. Ecological Informatics, 101235.

Taugourdeau, S., Villerd, J., Plantureux, S., Huguenin-Elie, O. & Amiaud, B. (2014). Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data. Ecology and Evolution, 4: 944-958.

Examples

## Not run: 
trait <- iris[,-5]
group <- iris[,5]

#Generating some random missing data
for (i in 1:10)
trait[sample(nrow(trait), 1), sample(ncol(trait), 1)] <- NA

#Estimating the missing data with different methods
fill(trait, "mean")
fill(trait, "mean", group)
fill(trait, "median")
fill(trait, "median", group)
fill(trait, "similar")
fill(trait, "similar", group)
fill(trait, "regression", step = FALSE)
fill(trait, "regression", group, step = TRUE)
fill(trait, "w_regression", step = TRUE)
fill(trait, "w_regression", weight = dist(trait), step = TRUE)
fill(trait, "PCA")

## End(Not run)

[Package BAT version 2.9.6 Index]