CST_EnsClustering {CSTools}R Documentation

Ensemble clustering

Description

This function performs a clustering on members/starting dates and returns a number of scenarios, with representative members for each of them. The clustering is performed in a reduced EOF space.

Motivation: Ensemble forecasts give a probabilistic insight of average weather conditions on extended timescales, i.e. from sub-seasonal to seasonal and beyond. With large ensembles, it is often an advantage to be able to group members according to similar characteristics and to select the most representative member for each cluster. This can be useful to characterize the most probable forecast scenarios in a multi-model (or single model) ensemble prediction. This approach, applied at a regional level, can also be used to identify the subset of ensemble members that best represent the full range of possible solutions for downscaling applications. The choice of the ensemble members is made flexible in order to meet the requirements of specific (regional) climate information products, to be tailored for different regions and user needs.

Description of the tool: EnsClustering is a cluster analysis tool, based on the k-means algorithm, for ensemble predictions. The aim is to group ensemble members according to similar characteristics and to select the most representative member for each cluster. The user chooses which feature of the data is used to group the ensemble members by clustering: time mean, maximum, a certain percentile (e.g., 75 time period. For each ensemble member this value is computed at each grid point, obtaining N lat-lon maps, where N is the number of ensemble members. The anomaly is computed subtracting the ensemble mean of these maps to each of the single maps. The anomaly is therefore computed with respect to the ensemble members (and not with respect to the time) and the Empirical Orthogonal Function (EOF) analysis is applied to these anomaly maps. Regarding the EOF analysis, the user can choose either how many Principal Components (PCs) to retain or the percentage of explained variance to keep. After reducing dimensionality via EOF analysis, k-means analysis is applied using the desired subset of PCs.

The major final outputs are the classification in clusters, i.e. which member belongs to which cluster (in k-means analysis the number k of clusters needs to be defined prior to the analysis) and the most representative member for each cluster, which is the closest member to the cluster centroid. Other outputs refer to the statistics of clustering: in the PC space, the minimum and the maximum distance between a member in a cluster and the cluster centroid (i.e. the closest and the furthest member), the intra-cluster standard deviation for each cluster (i.e. how much the cluster is compact).

Usage

CST_EnsClustering(
  exp,
  time_moment = "mean",
  numclus = NULL,
  lon_lim = NULL,
  lat_lim = NULL,
  variance_explained = 80,
  numpcs = NULL,
  time_dim = NULL,
  time_percentile = 90,
  cluster_dim = "member",
  verbose = F
)

Arguments

exp

An object of the class 's2dv_cube', containing the variables to be analysed. The element 'data' in the 's2dv_cube' object must have, at least, spatial and temporal dimensions. Latitudinal dimension accepted names: 'lat', 'lats', 'latitude', 'y', 'j', 'nav_lat'. Longitudinal dimension accepted names: 'lon', 'lons','longitude', 'x', 'i', 'nav_lon'.

time_moment

Decides the moment to be applied to the time dimension. Can be either 'mean' (time mean), 'sd' (standard deviation along time) or 'perc' (a selected percentile on time). If 'perc' the keyword 'time_percentile' is also used.

numclus

Number of clusters (scenarios) to be calculated. If set to NULL the number of ensemble members divided by 10 is used, with a minimum of 2 and a maximum of 8.

lon_lim

List with the two longitude margins in 'c(-180,180)' format.

lat_lim

List with the two latitude margins.

variance_explained

variance (percentage) to be explained by the set of EOFs. Defaults to 80. Not used if numpcs is specified.

numpcs

Number of EOFs retained in the analysis (optional).

time_dim

String or character array with name(s) of dimension(s) over which to compute statistics. If omitted c("ftime", "sdate", "time") are searched in this order.

time_percentile

Set the percentile in time you want to analyse (used for 'time_moment = "perc").

cluster_dim

Dimension along which to cluster. Typically "member" or "sdate". This can also be a list like c("member", "sdate").

verbose

Logical for verbose output

Value

A list with elements $cluster (cluster assigned for each member), $freq (relative frequency of each cluster), $closest_member (representative member for each cluster), $repr_field (list of fields for each representative member), composites (list of mean fields for each cluster), $lon (selected longitudes of output fields), $lat (selected longitudes of output fields).

Author(s)

Federico Fabiano - ISAC-CNR, f.fabiano@isac.cnr.it

Ignazio Giuntoli - ISAC-CNR, i.giuntoli@isac.cnr.it

Danila Volpi - ISAC-CNR, d.volpi@isac.cnr.it

Paolo Davini - ISAC-CNR, p.davini@isac.cnr.it

Jost von Hardenberg - ISAC-CNR, j.vonhardenberg@isac.cnr.it

Examples

dat_exp <- array(abs(rnorm(1152))*275, dim = c(dataset = 1, member = 4, 
                                              sdate = 6, ftime = 3, 
                                              lat = 4, lon = 4))
lon <- seq(0, 3)
lat <- seq(48, 45)
coords <- list(lon = lon, lat = lat)
exp <- list(data = dat_exp, coords = coords)
attr(exp, 'class') <- 's2dv_cube'
res <- CST_EnsClustering(exp = exp, numclus = 3,
                        cluster_dim = c("sdate"))


[Package CSTools version 5.2.0 Index]