tclustreg {fsdaR}R Documentation

Computes robust linear grouping analysis

Description

Performs robust linear grouping analysis.

Usage

tclustreg(
  y,
  x,
  k,
  alphaLik,
  alphaX,
  restrfactor = 12,
  intercept = TRUE,
  plot = FALSE,
  nsamp,
  refsteps = 10,
  reftol = 1e-13,
  equalweights = FALSE,
  mixt = 0,
  wtrim = 0,
  we,
  msg = TRUE,
  RandNumbForNini,
  trace = FALSE,
  ...
)

Arguments

y

Response variable. A vector with n elements that contains the response variable.

x

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

k

Number of groups.

alphaLik

Trimming level, a scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If alphaLik=0, there is no trimming. More in detail, if 0 < alphaLik < 1 clustering is based on h = floor(n * (1 - alphaLik)) observations. If alphaLik is an integer greater than 1 clustering is based on h = n - floor(alphaLik). More in detail, likelihood contributions are sorted and the units associated with the smallest n - h contributions are trimmed.

alphaX

Second-level trimming or constrained weighted model for x.

restrfactor

Restriction factor for regression residuals and covariance matrices of the explanatory variables. Scalar or vector with two elements. If restrfactor is a scalar it controls the differences among group scatters of the residuals. The value 1 is the strongest restriction. If restrfactor is a vector with two elements the first element controls the differences among group scatters of the residuals and the second the differences among covariance matrices of the explanatory variables. Note that restrfactor[2] is used just if alphaX=1, that is if constrained weighted model for x is assumed.

intercept

wheather to use constant term (default is intercept=TRUE

plot

If plot=FALSE (default) or plot=0 no plot is produced. If plot=TRUE a plot with the final allocation is shown (using the spmplot function). If X is 2-dimensional, the lines associated to the groups are shown too.

nsamp

If a scalar, it contains the number of subsamples which will be extracted. If nsamp = 0 all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets(). nsamp must have k * p columns. The first p columns are used to estimate the regression coefficient of group 1, ..., the last p columns are used to estimate the regression coefficient of group k.

refsteps

Number of refining iterations in each subsample. Default is refsteps=10. refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If equalweights=TRUE we are (ideally) assuming equally sized groups, else if equalweights = false (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.

mixt

Specifies whether mixture modelling or crisp assignment approach to model based clustering must be used. In the case of mixture modelling parameter mixt also controls which is the criterion to find the untrimmed units in each step of the maximization. If mixt>=1 mixture modelling is assumed else crisp assignment. The default value is mixt=0, i.e. crisp assignment. Please see for details the provided references. The parameter mixt also controls the criterion to select the units to trim. If mixt = 2 the h units are those which give the largest contribution to the likelihood, else if mixt=1 the criterion to select the h units is exactly the same as the one which is used in crisp assignment.

wtrim

How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).

  • If wtrim==0 (no weights), the algorithm reduces to the standard tclustreg algorithm.

  • If wtrim==1, trimming is done by weighting the observations using values specified in vector we. In this case, vector we must be supplied by the user.

  • If wtrim==2, trimming is again done by weighting the observations using values specified in vector we. In this case, vector we is computed from the data as a function of the density estimate pdfe. Specifically, the weight of each observation is the probability of retaining the observation, computed as

    pretain_{ig} = 1-pdfe_{ig}/max_{ig}(pdfe_{ig})

  • If wtrim==3, trimming is again done by weighting the observations using values specified in vector we. In this case, each element wei of vector we is a Bernoulli random variable with probability of success pdfe_{ig}. In the clustering framework this is done under the constraint that no group is empty.

  • If wtrim==4, trimming is done with the tandem approach of Cerioli and Perrotta (2014).

we

Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones.

msg

Controls whether to display or not messages on the screen If msg==TRUE (default) messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

RandNumbForNini

pre-extracted random numbers to initialize proportions. Matrix of size k-by-nrow(nsamp) containing the random numbers which are used to initialize the proportions of the groups. This option is effective only if nsamp is a matrix which contains pre-extracted subsamples. The purpose of this option is to enable the user to replicate the results when the function tclustreg() is called using a parfor instruction (as it happens for example in routine IC, where tclustreg() is called through a parfor for different values of the restriction factor). The default is that RandNumbForNini is empty - then uniform random numbers are used.

trace

Whether to print intermediate results. Default is trace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of class tclustreg.object

Author(s)

FSDA team, valentin.todorov@chello.at

References

Mayo-Iscar A. (2016). The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers, Computational Statistics and Data Analysis", Vol. 99, pp. 131-147.

Garcia-Escudero, L.A., Gordaliza, A., Greselin, F., Ingrassia, S. and Mayo-Iscar, A. (2017), Robust estimation of mixtures of regressions with random covariates, via trimming and constraints, Statistics and Computing, Vol. 27, pp. 377-402.

Garcia-Escudero, L.A., Gordaliza A., Mayo-Iscar A., and San Martin R. (2010). Robust clusterwise linear regression through trimming, Computational Statistics and Data Analysis, Vol. 54, pp.3057-3069.

Cerioli, A. and Perrotta, D. (2014). Robust Clustering Around Regression Lines with High Density Regions. Advances in Data Analysis and Classification, Vol. 8, pp. 5-26.

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.

Examples

 ## Not run: 
 ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013).
 ## The dataset presents two parallel components without contamination.

 data(X)
 y1 = X[, ncol(X)]
 X1 = X[,-ncol(X), drop=FALSE]

 (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, plot=TRUE, trace=TRUE))

 (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=2,
         mixt=2, plot=TRUE, trace=TRUE))

 ##  Examples with fishery data

 data(fishery)
 X <- fishery

 ## some jittering is necessary because duplicated units are not treated:
 ## this needs to be addressed
 X <- X + 10^(-8) * abs(matrix(rnorm(nrow(X)*ncol(X)), ncol=2))

 y1 <- X[, ncol(X)]
 X1 <- X[, -ncol(X), drop=FALSE]

 (out <- tclustreg(y1, X1, k=3, restrfact=50, alphaLik=0.04, alphaX=0.01, trace=TRUE))
 ## Example 2:

 ## Define some arbitrary weightssome arbitrary weights for the units
     we <- sqrt(X1)/sum(sqrt(X1))

 ##  tclustreg required parameters
     k <- 2; restrfact <- 50; alpha1 <- 0.04; alpha2 <- 0.01

 ##  Now tclust is run on each combination of mixt and wtrim options

     cat("\nmixt=0; wtrim=0",
         "\nStandard tclustreg, with classification likelihood and without thinning\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=0, trace=TRUE))

     cat("\nmixt=2; wtrim=0",
         "\nMixture likelihood, no thinning\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=0, trace=TRUE))

     cat("\nmixt=0; wtrim=1",
         "\nClassification likelihood, thinning based on user weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, we=we, wtrim=1, trace=TRUE))

     cat("\nmixt=2; wtrim=1",
         "\nMixture likelihood, thinning based on user weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, we=we, wtrim=1, trace=TRUE))

     cat("\nmixt=0; wtrim=2",
         "\nClassification likelihood, thinning based on retention probabilities\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=2, trace=TRUE))

     cat("\nmixt=2; wtrim=2",
         "\nMixture likelihood, thinning based on retention probabilities\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=2, trace=TRUE))

     cat("\nmixt=0; wtrim=3",
         "\nClassification likelihood, thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=3, trace=TRUE))

     cat("\nmixt=2; wtrim=3",
         "\nMixture likelihood, thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=3, trace=TRUE))

     cat("\nmixt=0; wtrim=4",
         "\nClassification likelihood, tandem thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=0, wtrim=4, trace=TRUE))

     cat("\nmixt=2; wtrim=4",
         "\nMixture likelihood, tandem thinning based on bernoulli weights\n")
     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,
             mixt=2, wtrim=4, trace=TRUE))

 
## End(Not run)

[Package fsdaR version 0.9-0 Index]