summarize_corr2 {sparklyr.flint}R Documentation

Pairwise correlation summarizer

Description

Compute pairwise correations for all possible pairs of columns such that the first column of each pair is one of 'xcolumns' and the second column of each pair is one of 'ycolumns', storing results in new columns named with the following pattern: '<column1>_<column2>_correlation' and '<column1>_<column2>_correlationTStat' for each pair of columns (column1, column2)

Usage

summarize_corr2(
  ts_rdd,
  xcolumns,
  ycolumns,
  key_columns = list(),
  incremental = FALSE
)

Arguments

ts_rdd

Timeseries RDD being summarized

xcolumns

A list of column names

ycolumns

A list of column names disjoint from xcolumns

key_columns

Optional list of columns that will form an equivalence relation associating each record with the time series it belongs to (i.e., any 2 records having equal values in those columns will be associated with the same time series, and any 2 records having differing values in those columns are considered to be from 2 separate time series and will therefore be summarized separately) By default, 'key_colums' is empty and all records are considered to be part of a single time series.

incremental

If FALSE and 'key_columns' is empty, then apply the summarizer to all records of 'ts_rdd'. If FALSE and 'key_columns' is non-empty, then apply the summarizer to all records within each group determined by 'key_columns'. If TRUE and 'key_columns' is empty, then for each record in 'ts_rdd', the summarizer is applied to that record and all records preceding it, and the summarized result is associated with the timestamp of that record. If TRUE and 'key_columns' is non-empty, then for each record within a group of records determined by 1 or more key columns, the summarizer is applied to that record and all records preceding it within its group, and the summarized result is associated with the timestamp of that record.

Value

A TimeSeriesRDD containing the summarized result

See Also

Other summarizers: ols_regression(), summarize_avg(), summarize_corr(), summarize_count(), summarize_covar(), summarize_dot_product(), summarize_ema_half_life(), summarize_ewma(), summarize_geometric_mean(), summarize_kurtosis(), summarize_max(), summarize_min(), summarize_nth_central_moment(), summarize_nth_moment(), summarize_product(), summarize_quantile(), summarize_skewness(), summarize_stddev(), summarize_sum(), summarize_var(), summarize_weighted_avg(), summarize_weighted_corr(), summarize_weighted_covar(), summarize_z_score()

Examples


library(sparklyr)
library(sparklyr.flint)

sc <- try_spark_connect(master = "local")

if (!is.null(sc)) {
  sdf <- copy_to(
    sc,
    tibble::tibble(t = seq(10), x1 = rnorm(10), x2 = rnorm(10), y1 = rnorm(10), y2 = rnorm(10))
  )
  ts <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
  ts_corr2 <- summarize_corr2(ts, xcolumns = c("x1", "x2"), ycolumns = c("y1", "y2"))
} else {
  message("Unable to establish a Spark connection!")
}


[Package sparklyr.flint version 0.2.2 Index]