| archetypal-package {archetypal} | R Documentation | 
Finds the Archetypal Analysis of a Data Frame
Description
Performs archetypal analysis by using Principal Convex Hull Analysis (PCHA) under a full control of all algorithmic parameters. It contains a set of functions for determining the initial solution, the optimal algorithmic parameters and the optimal number of archetypes. Post run tools are also available for the assessment of the derived solution.
Compute Archetypal Analysis (AA)
The main function is archetypal which is a variant of PCHA algorithm, see [1], [2],
suitable for R language. It provides control to the entire set of involved parameters  and has two main options:   
- initialrows = NULL, then a method from "projected_convexhull", "convexhull", 
 "partitioned_convexhul", "furthestsum", "outmost", "random" is used
- initialrows = (a vector of kappas rows), then given rows form the initial solution for AA 
This is the main function of the package, but extensive trials has shown that:
- AA may be very difficult to run if a random initial solution has been chosen 
- for the same data set the final Sum of Squared Errors (SSE) may be much smaller if initial solution is close to the final one 
- even the quality of AA done is affected from the starting point 
This is the reason why we have developed a whole set of methods for choosing initial solution for the PCHA algorithm.
Find a time efficient initial approximation for AA
There are three functions that work with the Convex Hull (CH) of data set.
-   find_outmost_convexhull_pointscomputes the CH of all points
-  find_outmost_projected_convexhull_pointscomputes the CH for all possible combinations of variables taken bynpr(default=2)
-  find_outmost_partitioned_convexhull_pointsmakesnppartitions of data frame (defualt=10), then computes CH for each partition and finally gives the CH of overall union
The most simple method for estimating an initial solution is find_outmost_points
where we just compute the outermost points, i.e. those that are the most frequent outermost for all
available points.
The default method "FurthestSum" (FS) of PCHA (see [1], [2]) is used by find_furthestsum_points which applies
FS for nfurthest times (default=10) and then finds the most frequent points. 
Of course "random" method is available for comparison reasons and that gives a random set of kappas points as initial solution.
All methods give the number of rows for the input data frame as integers. Attention needed if your data frame
has row names which are integers but not identical to 1:dim(df)[1].
Find the optimal number of archetypes
For that task find_optimal_kappas is available which 
runs for each kappas from 1 to maxkappas (default=15) ntrials (default=10) times AA, 
stores SSE, VarianceExplained from each run and then computes knee or elbow point by using UIK method, see [3].
Determining the optimal updating parameters
Extensive trials have shown us that choosing the proper values for algorithmic updating parameters 
(muAup, muAdown, muBup, muBdown) can speed up remarkably the process. That is the task of
find_pcha_optimal_parameters which  conducts a grid search with different values 
of these parameters and returns the values which minimize the SSE after a fixed number of iterations (testing_iters, default=10).
Evaluate the quality of Archetypal Analysis
By using function check_Bmatrix we can evaluate the overall quality of 
applied method and algorithm. Quality can be considered high:
- if every archetype is being created by a small number of data points 
- if relevant weights are not numerically insignificant 
Of course we must take into account the SSE and VarianceExplained, but if we have to compare two solutions with similar termination status, then we must choose that of the simplest B matrix form.
Resampling
The package includes a function for resampling (grouped_resample) which may be used for standard bootstrapping or for subsampling. 
This function allows samples to be drawn with or without replacement, by groups and with or without Dirichlet weights. 
This provides a variety of options for researchers who wish to correct sample biases, estimate empirical confidence intervals, 
and/or subsample large data sets. 
Post-run tools
Except from check_Bmatrix there exist next functions for checking the convergence process itself and
for examining the local neighborhoud of archetypes:
- The function - study_AAconvergenceanalyzes the history of iterations done and produces a multi-panel plot showing the steps and quality of the convergence to the final archetypes.
- By setting the desired number - npointsas argument in function- find_closer_pointswe can then find the data points that are in the local neighborhood of each archetype. This allows us to study the properties of the solution or manually choose an initial approximation to search for a better fit.
Note
Bug reports and feature requests can be sent to
dchristop@econ.uoa.gr or 
 dem.christop@gmail.com.
Author(s)
Maintainer: Demetris Christopoulos dchristop@econ.uoa.gr
Other contributors:
- David Midgley david.midgley@insead.edu [contributor] 
- Sunil Venaik s.venaik@business.uq.edu.au [contributor] 
- INSEAD Fontainebleau France [funder] 
References
[1] M Morup and LK Hansen, "Archetypal analysis for machine learning and data mining", Neurocomputing (Elsevier, 2012). https://doi.org/10.1016/j.neucom.2011.06.033.
[2] Source: https://mortenmorup.dk/?page_id=2 , last accessed 2024-03-09
[3] Christopoulos, Demetris T., Introducing Unit Invariant Knee (UIK) As an Objective Choice for Elbow Point in Multivariate Data Analysis Techniques (March 1, 2016). Available at SSRN: https://ssrn.com/abstract=3043076 or http://dx.doi.org/10.2139/ssrn.3043076