create.reco.train.test {recometrics}R Documentation

Create Train-Test Splits of Implicit-Feedback Data

Description

Creates train-test splits of implicit-feedback data (recorded user-item interactions) for fitting and evaluating models for recommender systems.

These splits choose "test users" and "items for a given user" separately, offering three modes of splitting the data:

Usage

create.reco.train.test(
  X,
  split_type = "separated",
  users_test_fraction = 0.1,
  max_test_users = 10000L,
  items_test_fraction = 0.3,
  min_items_pool = 2L,
  min_pos_test = 1L,
  consider_cold_start = FALSE,
  seed = 1L
)

Arguments

X

The implicit feedback data to split into training-testing-remainder for evaluating recommender systems. Should be passed as a sparse CSR matrix from the 'Matrix' package (class 'dgRMatrix'). Users should correspond to rows, items to columns, and non-zero values to observed user-item interactions.

split_type

Type of data split to generate. Allowed values are: 'all', 'separated', 'joined' (see the function description above for more details).

users_test_fraction

Target fraction of the users to set as test (see the function documentation for more details). If the number represented by this fraction exceeds the number set by 'max_test_users', then the actual number will be set to 'max_test_users'. Note however that the end result might end up containing fewer users if there are not enough users in the data meeting the minimum desired criteria.

If passing 'NULL', will not take a fraction, but will instead take the number that is passed for 'max_test_users'.

Ignored when passing ‘split_type=’all''.

max_test_users

Maximum number of users to set as test. Note that this will only be applied for choosing the minimum between this and 'ncol(X)*users_test_fraction', while the actual number might end up being lower if there are not enough users meeting the desired minimum conditions.

If passing 'NULL' for 'users_test_fraction', will interpret this as the number of test users to take.

Ignored when passing ‘split_type=’all''.

items_test_fraction

Target fraction of the data (items) to set for test for each user. Should be a number between zero and one (non-inclusive). The actual number of test entries for each user will be determined as 'round(n_entries_user*items_test_fraction)', thus in a long-tailed distribution (typical for recommender systems), the actual fraction that will be obtained is likely to be slightly lower than what is passed here.

Note that items are sampled independently for each user, thus the items that are in the test set for some users might be in the training set for different users.

min_items_pool

Minimum number of items (sum of positive and negative items) that a user must have in order to be eligible as test user.

min_pos_test

Minimum number of positive entries (non-zero entries in the test set) that users would need to have in order to be eligible as test user.

consider_cold_start

Whether to still set users as eligible for test in situations in which some user would have test data but no positive (non-zero) entries in the training data. This will only happen when passing 'test_fraction>=0.5'.

seed

Seed to use for random number generation.

Value

Will return a list with two to four elements depending on the requested split type:

The training and testing items for each user will not intersect, and each item in the original 'X' data for a given test user will be assigned to either the training or the testing sets.


[Package recometrics version 0.1.6-3 Index]