R: Generate Data of Varying Complexity

generate_data {SPLICE}

R Documentation

Generate Data of Varying Complexity

Description

Generates datasets under 5 scenarios of different levels of complexity (here "complexity" means the level of difficulty of analysis).

Usage

generate_data(
  n_claims_per_period,
  n_periods = 40,
  complexity = c(1:5),
  data_type = c("claims", "payments", "incurred"),
  random_seed = NULL,
  verbose = TRUE,
  covariates_obj = NULL
)

Arguments

`n_claims_per_period`	expected number of claims per period (equals the total expected number of claims divided by `n_periods`).
`n_periods`	number of accident periods considered (equals number of claims development periods considered); default 40.
`complexity`	integer from 1 (simplest) to 5 (most complex); see Details.
`data_type`	a character vector specifying output data types. By default the function will output all 3 datasets (claims, payments, incurred), but the user may choose to output only a subset.
`random_seed`	optional seed for random number generation for reproducibility.
`verbose`	logical; if `TRUE` print a message about the data generated.
`covariates_obj`	a SynthETIC `covariates` object (requires `⁠SynthETIC >= 1.1.0⁠`). Defaults to `NULL`.

Details

generate_data() produces datasets of varying levels of complexity, where 1 represents the simplest, and 5 represents the most complex:

1 – simple, homogeneous claims experience, with zero inflation.
2 – slightly more complex than 1, with dependence of notification delay and settlement delay on claim size, and 2% p.a. base inflation.
3 – steady increase in claim processing speed over occurrence periods (i.e. steady decline in settlement delays).
4 – inflation shock at time 30 (from 0% to 10% p.a.).
5 – default distributional models, with complex dependence structures (e.g. dependence of settlement delay on claim occurrence period).

We remark that this by no means defines the limits of the complexity that can be generated with SPLICE. This function is provided for the convenience of users who wish to generate (a collection of) datasets under some representative scenarios. If more complex features are required, the user is free to modify the distributional assumptions (which, of course, requires more thoughts and coding) to achieve their purposes.

Value

A named list of dataframes:

`claim_dataset`	A dataset of claim records that takes the same structure as `test_claim_dataset`, with each row representing a unique claim.
`payment_dataset`	A dataset of partial payment records that takes the same structure as `test_transaction_dataset`, with each row representing a unique payment.
`incurred_dataset`	A dataset of transaction records that tracks how the case estimates change over time. Takes the same structure as `test_incurred_dataset`, with each row representing a transaction (any of claim notification, settlement, a payment, or a case estimate revision).
`covariates_data`	Only if `covariates_obj` is not NULL, in which case it will return a SynthETIC `covariates_data` object.

Examples

# Generate datasets of full complexity
result <- generate_data(
  n_claims_per_period = 50, data_type = c('claims', 'payments'),
  complexity = 5, random_seed = 42)

# Save individual datasets
claims <- result$claim_dataset
payments <- result$payment_dataset

# Generate chain-ladder compatible dataset
CL_simple <- generate_data(
  n_claims_per_period = 50, data_type = 'claims', complexity = 1, random_seed = 42)

# To mute message output
CL_simple_2 <- generate_data(
  n_claims_per_period = 50, data_type = 'claims', verbose = FALSE, random_seed = 42)

# Ouput is reproducible with the same random_seed value
all.equal(CL_simple$claim_dataset, CL_simple_2$claim_dataset)

[Package SPLICE version 1.1.2 Index]